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Foreword 



Database research is a field of computer science where theory meets applications. 
Many concepts and methods, that were regarded as issues of theoretical interest 
when initially proposed, are now included in implemented database systems 
and related products. Examples abound in the fields of database design, query 
languages, query optimization, concurrency control, statistical databases, and 
many others. 

The papers contained in this volume were presented at ICDT’99, the 7th In- 
ternational Conference on Database Theory, in Jerusalem, Israel, January 10-12, 
1999. ICDT is an international forum for research on the principles of database 
systems. It is a biennial conference, and has a tradition of being held in beauti- 
ful European sites: Rome in 1986, Bruges in 1988, Paris in 1990, Berlin in 1992, 
Prague in 1995, and Delphi in 1997. From 1992, ICDT has been merged with 
another series of conferences on theoretical aspects of database systems. The 
Symposium on Mathematical Fundamentals of Database Systems (MFDBS), 
that was initiated in Dresden (1987), and continued in Visegrad (1989) and 
Rostock (1991). ICDT aims to enhance the exchange of ideas and cooperation 
in database research both within unified Europe, and between Europe and the 
other continents. 

ICDT’99 was organized in cooperation with: 

ACM Special Interest Group on Management of Data (Sigmod) 

IEEE Israel Chapter 

ILA — The Israel Association for Information Processing 
EDBT Foundation 

ICDT’99 was sponsored by: 

The Hebrew University of Jerusalem 
Tel Aviv University 

Tandem Labs Israel, a Compaq Company 



This volume contains 26 technical papers selected from 89 submissions. In 
addition to the technical papers, the conference featured two invited presen- 
tations: Issues Raised by Three Years of Developing PJama: An Orthogonally 
Persistent Platform for Java^^ by Malcolm Atkinson and Mick Jordan, and 
Novel Computational Approaches to Information Retrieval and Data Mining by 
Christos H. Papadimitriou. The conference also featured a state-of-the-art tuto- 
rial on Description Logics and Their Relationships with Databases by Maurizio 
Lenzerini. 
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The conference organization committee consisted of Catriel Beeri and Tova 
Milo. Administrative support was provided by Dan Knassim, Ltd, of Ramat- 
Gan, Israel. 

We wish to thank all the authors who submitted papers for consideration, 
the program committee members, and the external reviewers for their efforts 
and for their care in evaluating the submitted papers. We also wish to thank the 
organizing committee and the staff of Dan Knassim for their efforts with the local 
organization of the conference and Hartmut Liefke for his help in organizing the 
submissions and reviews. Last, but not least, we wish to express our appreciation 
and gratitude to the sponsoring organizations for their assistance and support. 

, , , , ... Catriel Beeri and Peter Buneman 

Jerusalem, January 1999 „ • 

Program Go-Ghairs 
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Issues Raised by Three Years of Developing 

PJama: 

An Orthogonally Persistent Platform for Java^'^ 

Malcolm Atkinson and Mick Jordan* 

University of Glasgow, Glasgow G12 8QQ, Scotland and Sun Microsystems 
Laboratories, 901 San Antonio Road MS MTV29-110, Palo Alto, CA 94303 



Abstract. Orthogonal persistence is based on three principles that have 
been understood for nearly 20 years. PJama is a publically available 
prototype of a Java platform that supports orthogonal persistence. It is 
already capable of supporting substantial applications. 

The experience of applying the principles of orthogonal persistence to 
the Java programming language is described in the context of PJama. 
For example, issues arise over achieving orthogonality when there are 
classes that have a special relationship with the Java Virtual Machine. 
The treatment of static variables and the definition of reachability for 
classes and the handling of the keyword transient also pose design prob- 
lems. The model for checkpointing the state of a computation, including 
live threads, is analyzed and related to a transactional approach. The 
problem of dealing with state that is external to the PJama environ- 
ment is explained and the solutions outlined. The difficult problem of 
system evolution is identified as a major barrier to deploying orthogonal 
persistence for the Java language. 

The predominant focus is on semantic issues, but with concern for rea- 
sonably efficient implementation. We take the opportunity throughout 
the paper and in the conclusions to identify directions for further work. 



1 Introduction 

In three short years the Java programming language and associated class li- 
braries, collectively referred to as the Java platform, has achieved an unprece- 
dented degree of acceptance and adoption by academia and industry. The Java 
language provides the programmer with a simple but powerful object model, 
a strong type system, automatic storage management and concurrency through 
lightweight threads. Within the closed world of an executing Java program, these 
properties are extremely helpful, vital even, in the timely development of reliable 
and robust applications. However, there is no satisfactory way of maintaining 
these properties beyond the single execution of a Java virtual machine. Instead 
the programmer must deal explicitly with saving the state of an application, 

* Java and all Java-based trademarks and logos are trademarks or registered trade- 
marks of Sun Microsystems, Inc. in the United States and other countries. 



Catriel Beeri, Peter Buneman (Eds.): ICDT’99, LNCS 1540, pp. 1-^21 1998- 
(c) Springer- Verlag Berlin Heidelberg 1998 
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using some combination of a variety of persistence mechanisms, for example, 
file input/output, object serialization, relational database connectivity, none of 
which provide complete support for the full computational model. This lack of 
completeness, while only a minor nuisance for simple applications, becomes a 
serious problem as application complexity increases. 

Mindful of the limitations of the traditional approaches to persistence, the 
Forest project at Sun Microsystems Laboratories initiated a collaborative re- 
search project with the Persistence and Distribution Group at Glasgow Univer- 
sity to apply the principles of orthogonal persistence |B| to the Java language. 
This project, PJama0, begun shortly after the first public release of the Java 
platform in May 1995. It has been running in parallel with the mainstream de- 
velopment of the Java language, and a series of prototypes have been developec0. 
While these prototypes have validated the principles of orthogonal persistence, 
there remain a number of holes in the completeness property, and a number of 
issues that are still under debate. In this paper we use the term OPJ to denote 
the abstract notion of orthogonal persistence for the Java language, and PJama 
for our specific prototype implementations. 

Over the past three years there have been some significant additions to the 
Java language, and substantial additions to the set of standard class libraries. 
In addition, many other class libraries have been added as standard extensions, 
and many third-party libraries have become available. Inevitably some of these 
changes have implications for orthogonal persistence. Most significantly, certain 
idioms that have become widespread in the class libraries are either at odds 
with orthogonal persistence or at least would not be required were orthogonal 
persistence a standard part of the platform. 

Meanwhile, the phenomenal success of the Java platform has caused many 
object database vendors to develop systems for it. Some of these systems con- 
form quite closely to the principles of orthogonal persistence, but are shaped 
by the object database viewpoint, primarily the model proposed by the Object 
Database Management Group (ODMG) Specifically, these systems all re- 
quire that access to persistent objects is through a transaction interface. There 
are a number of unresolved issues in the relationship between the transaction 
approach and the Java language semantics, which we discuss in section 0 

This paper is a sequel to one that described our very early experiences with 
the first PJama prototype m and is based on HH|. We begin with a brief 
discussion of the standard support for persistence in the Java platform, and 
then review the principles of orthogonal persistence, and their application to the 
Java language. We then discuss the model of checkpointing the computational 
state and its relationship to the transactional approach. This is followed by a 
detailed discussion of the problem of handling state that is external to the Java 
environment, which is a small but significant issue for the OPJ programmer. We 



^ We originally used the name PJava but this was taken as the trade mark for Personal 
Java. 



^ The initial design and motivation is described in m- Some implementation issues 
are described in ESinEDl. 




Issues Raised by Three Years of Developing PJama 



3 



then briefly discuss the difficult problem of class and platform evolution in OPJ. 
Finally, we review performance issues before drawing conclusions. The reader is 
assumed to have a working knowledge of the Java language and the core class 
libraries. A number of issues requiring further research have been highlighted in 
the main text. 



2 Persistence in the Java Platform 

The fact that the Java language contains the keyword transient suggests the 
designers gave some thought to persistence. However the first version of the Java 
language specification, JLS 1.0 eg, only provides a hint as to how trainsient 
might be interpreted in a future version of the specification: 

’’Variables may be marked transient to indicate that they are not part 
of the persistent state of an object. If an instance of the class Point: 

class Point { 

transient float rho, theta; 

} 

were saved to storage by a system service, then only the fields x and y 
would be saved. This specification does not yet specify details of such 
services; we intend to provide them in a future version of this specifica- 
tion.” 

As we shall discuss later lh.4| the specification has now been defined, unfortunately 
in a way that is incompatible with orthogonal persistence. 

In the initial release of the Java language, the basic form of persistence 
was a traditional programming-language mechanism, namely encoding and de- 
coding basic types to and from output and input streams that can be con- 
nected to some external device, such as a file system or a network socket. 
Built on this was an ad hoc scheme for encoding and decoding a property table 
(java. util. Properties) to and from a stream. 

Shortly afterwards the Java Database Connectivity (JDBC"'"’^ ) API was 
added to provide a standard way to communicate with a relational database. 
While providing access to existing relational data, JDBC can be used to save 
Java object state, and several object-relational mapping systems are 

being developed that automate this process, which is very tedious to program 
manually. 

With the JDK 1.1 release of the Java platform came the support for Java 
Object Serialization (JOS), a mechanism that supports the encoding and subse- 
quent decoding of nearly any object to and from a stream. Unlike the property 
table encoding, which is text-based, JOS uses a binary encoding format. JOS is 
effectively the default persistence mechanism for the Java platform and is used 
extensively in the JavaBeans"'"'^ framework m- It is also used as the argument 
marshalling protocol for the Java Remote Method Invocation framework 12^. 
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2.1 Persistence and Application Development 

The problems of dealing with persistent state have a profound impact on the 
overall approach that programmers bring to application development. In the 
absence of orthogonal persistence, every application must deal with converting 
the internal state into an appropriate external state and vice versa. It is believed 
that as much as 30% of the code of an application is occupied with this task. 

We observe several costs with this approach: 

— The application programmers encounter an additional conceptual load, which 
may include the need to understand two different models of the application 
domain, an additional technology and the interface mechanisms. 

— The “business” logic of the application may be obscured or distored by the 
mechanisms for managing persistence, this often increases maintenance costs. 

— The computation may be inefficient (with a long start up and shut down 
latency) because application programmers, in an attempt to minimise the 
previous two effects adopt a traditional structure (described below) . 

The model of “input, process, output”, that is encouraged by the lack of orthog- 
onal persistence, is very deeply ingrained in the programming language com- 
munity, so deeply in fact that experienced programmers often have trouble un- 
derstanding the concept of orthogonal persistence and want to see the “API” . 
In [fi3] Liedke observes that novice programmers easily absorbed the concepts 
of orthogonal persistence but that seasoned programmers continued initially to 
design mappings between “internal” and “external” data structures. 

It must be admitted that orthogonal persistence is an inherently language- 
based approach to data definition, in contrast to the file-format approach that 
so typifies the current computing landscape. Indeed the Java platform has itself 
contributed a multitude of new file formats. 

It is not clear whether this tension will ever be resolved. What is clear how- 
ever, is that orthogonal persistence fully supports the long-term preservation of 
the type safety and consistency of the Java language, whereas ad hoc persistence 
does not. Ultimately, if these values are deemed important, they must surely be 
propagated to the domain of persistent storage. 

Another popular structure for applications is: 

1. Run queries to extract data from various databases; 

2. Perform processing; 

3. Update one or more databases with the results; 

4. Repeat from 1 until no more to do. 

This form of application can utilise orthogonal persistence in one of two ways. 
It can be written using orthogonal persistence instead of the databases. The 
loop avoids much of the explicit extraction and all of the explicit update. Data 
translation and interworking between different persistent regimes is avoided. 
Alternatively, orthogonal persistence can be used to store intermediate results 
and parts of the data, while the basic loop also uses legacy systems in traditional 
databases. 
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Problem 2.1. It is an open question how best to advise application designers and 
programmers, so that they exploit the potential of orthogonal persistence in the 
context of legacy systems. 

2.2 The Principles of Orthogonal Persistence 

The concept of orthogonal persistence has been developed and refined over many 
years of research and development in the academic community P|. While the 
general principles are language independent, they must be tailored appropriately 
when applied to a specific language. 

Orthogonal persistence is defined by three principles: 

— Type Orthogonality: persistence is available for all data irrespective of type. 

— Transitive Persistence: the lifetime of all objects is determined by reachability 
from a designated set of root objects. 

— Persistence Independence: it is indistinguishable whether code is operating 
on short-lived or long-lived data. 

The notion of orthogonality is often referred to loosely as transparent persistence. 
Many systems claim transparency yet, on deeper investigation, the transparency 
does not fully embrace the spirit of orthogonal persistence. In particular, since 
there is some latitude in the application to a particular programming language, 
there is inevitably the opportunity for disagreement on exactly how the principles 
should be applied. 

3 Applying the Principles to the Java Language 

In this section, we discuss how the principles are applied to the Java language, 
using the PJama system as a base-line, since it specifically aims to provide 
orthogonal persistence for the Java platform. 

3.1 Type Orthogonality 

The Java language provides two kinds of type: primitive types, for example int, 
and reference types, namely: array types, class types and interface types. 
The values of a reference type are references to objects. All objects, including 
arrays, support the methods of class Object. Other than the issue of references 
to external state, discussed in section El there are few difficulties in making 
user-defined classes persistent. 

The principal problem with user-defined classes is the specification of the 
Object.hashCode method, which does not require that the returned value be 
consistent across different applications or between different executions of the 
same application. The default implementation is based on the memory address 
of the object, which guarantees inconsistency if the hashcode is recomputed each 
time the object is brought into memory at a different address. We believe that the 
correct model for OPJ is to consider a persistent store as embodying a suspended 
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virtual machine, which means that each execution is the same application in the 
sense of the JLS. Therefore, the value returned by Object .hashCode must be 
consistent across such executions. This issue is covered in detail in |3|. We have 
only recently modified PJama to maintain this invariant, which turns out to be 
important for some core classes. 

Problem 3.1. Is the requirement for equivalence with a suspended virtual ma- 
chine execution adequate to guarantee persistence independence? 

Classes Requiring Special Support Some of the core classes, however, are 
a much more challenging proposition for OPJ. The classes Class, Thread and 
Throwable are intimately connected with the language definition and require 
special support from the virtual machine. Every object has an associated class, 
denoted by a unique instance of the class Class and this binding cannot be 
changed by program action. Every program consists of a number of independent 
threads of control, denoted by instances of the class Thread. The normal control 
flow may be short circuited by the throw statement, which takes a Throwable 
value as its argument, to the body of a matching catch clause of a try statement. 
The programmer may treat instances of Class, Thread and Throwable in a first- 
class way, store their values in other variables, pass them as arguments and so 
on. Therefore, by extension, the principle of type orthogonality must apply to 
instances of these types. 

The Class Class Although not fully enforced by the Java language, the model 
of object-oriented programming is that a client access an object through an in- 
terface that defines its abstract state and behavior. In practice this is realized by 
a combination of instance variables in the object and methods that perform some 
computation. In general, the methods that access the state of the object may 
perform an arbitrary computation. In short, the methods defined in the class of 
an object play a crucial role in defining the object. There is no way that a user of 
the interface can separate the physical manifestation of the state in the instance 
variables from the code that uses it to provide the abstract state. Furthermore, 
once created, an object cannot change the binding of its methods. From this we 
can conclude that orthogonal persistence must maintain that binding in order 
to preserve the type consistency of the language. By extension this requires that 
the code of a class be made persistent along with its instances. We discuss the 
issues of class reachability in more detail in section |E1 



The Class Thread The availability of user-defined threads as first-class objects 
only serves to emphasize the issue of the persistence of execution state. Even a 
language without such facility has an implicit thread that carries the execution 
state. The fundamental issue is where computation resumes after a checkpoinlU. 



® A checkpoint occurs when called explicitly by an application (using 
PJStore . stabilizeAllO in PJama), on successful transaction commit and 
on normal program termination. 
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This is a complex issue that is discussed in more detail in section lb. 41 Here 
we merely observe that, since the language permits the programmer to treat 
instances of class Thread like any other object from the perspective of assignment 
compatibility, the programmer can construct object structures that effectively 
capture execution stat^ Therefore, OPJ must support this correctly. This is 
not a trivial matter, since thread state includes held locks in addition to the 
activation stack. 

Owing to the complexity of the thread implementation in the JDK 1.1 virtual 
machine, PJama does not currently support persistent threads. Removing this 
limitation is high on our priority list. 

The Class Exception Exception objects are usually thrown, caught and then 
discarded. However, it is possible to retain them and subsequently access their 
state, such as any message string, or the stack trace. JLS 1.0 does not define the 
format of the stack trace, but it does suggest there are limits to the duration of 
its validity. Experimentally, the information survives the stack being overwritten 
with subsequent activations. Currently, when an exception is made persistent in 
PJama, the stack trace information is lost on restart, although this could be 
fixed with some effort. 

Problem 3.2. We refer to the set of classes whose run-time state and behaviour is 
intimately connected with the implementation of a Java Virtual Machine (JVM) 
as the intrinsic classes. Is this a set determined only by the Java languages 
definition? Is it possible (and acceptably efficient) to define abstract reflection 
interfaces over these classes (and their instances) so that if JVMs support these 
interfaces, orthogonal persistence can be implemented without the implementers 
of persistence having to modify a JVM? That is, can completely orthogonal 
persistence be implemented in a JVM-neutral way if suitable reflection interfaces 
are added to the intrinsic classes? 

Static Variables Unlike C-|— I- or C, the Java language has no isolated global 
variables, only variables declared static in a top-level class definition. There is 
one instance of such a static variable for every instance of the Class objecfl 
Although the specification does not mandate such an implementation, it is intu- 
itive to imagine static variables as if they were ordinary instance variables of the 
associated class’s Class object. One common use for static variables is for data 
that pertains to all instances of the class. If the data reachable from the static 
variables contributes to the invariants of the class, it is important that it become 
persistent if instances of the class become persistent, otherwise the invariants as- 
sumed by the class will not hold on restart. In short, type orthogonality requires 
that static variables be made persistent along with a class’s Class instanc^ 

For example, certain of the AWT classes do this. 

® An instance of a Class object is constructed each time a class is loaded. Any given 

ClassLoader may only load a class once. However, an application may nse multiple 

ClassLoaders to obtain independent instances of a class within one execntion. 

® Other aspects of static variables are discnssed later in the paper. 
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PJama supports the persistence of static variables unless they are marked 
transient. This sets PJama apart from all other known persistence solutions for 
the Java platform, which treat static variables as implicitly transient. 

Problem 3.3. The correct combination of persistence with static variables needs 
to be established. 



3.2 Persistence by Reachability 

Transitive persistence is defined as persistence by reachability from a designated 
set of roots. Reachability is already used in the Java language to determine the 
life-time of objects, and persistence by reachability is an obvious extension of this 
construct. However, the issue is quite complex owing to the interaction between 
dynamic class loading and the reachability of classes. 



Reachability in the Java Language All Java objects are allocated on a heap 
and the lifetime of an object is determined automatically by reachability from 
the set of live threads and the static variables of the set of loaded classes. There 
is a subtlety relating to the set of loaded classes that was only clarified recently, 
namely exactly how the reachability of a Class instance is defined. If only the 
thread stacks act as roots, then many Class instances might become unreach- 
able during execution, causing them to be garbage collected. However, later in 
the program execution a new instance of a garbage-collected class might be cre- 
ated, requiring the creation of a new Class instance. If class garbage collection 
is implemented as class unloading, which is desirable in certain environments, 
for example, browsers, then this may result in a class being loaded several times 
during program execution, with the attendant loss of state held in static vari- 
ables due to static initialization occurring on each load. Since static variables are 
the only truly global state in a Java program, this behavior is sometimes unde- 
sirable and arguably a violation of the language semantics. The issue is further 
complicated by the provision for user-defined class loaders in the Java language. 

The resolution of this problem states that a Class instance is unreachable 
if and only if its class loader is unreachable, and that the system class loader 
is always reachable. This effectively defines the (hidden) table of system-loaded 
classes, and their associated static variables, as an additional set of roots. 

This issue is significant for orthogonal persistence as it strengthens the ar- 
gument for maintaining the binding between an object and its Class instance, 
and for static variables to be persistent by default. 



Class Reachability The requirement for persistence of instances of class Class, 
including the associated behavior as specified by the method bodies, is unassail- 
able. However, because of the semantics of dynamic class loading in the Java 
language, it is still unclear how to treat the reachability that is implicit in the 
class definition, in particular in the field declarations, method signatures and 
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method bodies. This implicit reachability is characterized by symbolic refer- 
ences to classes that may not yet have been loaded, because no active us^ has 
yet been made of them. For example, the class of a local variable of a method 
that has not been invoked at the time of a checkpoint may not have been loaded. 

An important issue is whether the transitive persistence principle should 
apply in this case. The case for answering ”yes” is based on the belief that the 
complete behavior of the class must be captured with the object, and that the 
behavior of the named class contributes to this. The case for answering ”no” 
is partly based on pragmatic issues such as the cost of loading such classes 
at the checkpoint, partly on consistency with the normal behavior of the Java 
platform and partly on the fact that it is impossible in general to determine the 
complete set of reachable classes. This is because of the dynamic class naming 
provided by the Class . forNaune method, which takes a String value, and by 
the reflection subsystem, which also allows references to classes named through 
String values. We argue that reference to a class name via a language identifier 
is strongly suggestive that reachability should be enforced, because the classflle 
was generated by compiling the referencing class against a specific instance of 
the referenced class, whereas there is no direct connection in the case of dynamic 
class naming. 

The term “loading a class” is often used loosely to describe a sequence of dis- 
joint actions that may, in practice, be separated in time. The JLS separates the 
specification of class loading into several stages, namely loading, linking, and ini- 
tialization. Loading is the portion that generates a Class instance from a binary 
form of a class, typically a classflle. Linking is further divided into verification, 
preparation, and resolution. Resolution is the process that corresponds to the 
notion of class reachability as it is required to resolve all symbolic references 
to other classes. A symbolic reference corresponds directly to name used as an 
identifier in the source code. The JLS explicitly permits an implementation to 
resolve classes eagerly, although it requires that any errors detected during load- 
ing that might be thrown as exceptions, are actually thrown at a point in the 
program where some action is taken that might cause the loading to occur. Ini- 
tialization is the process of executing static (class) variable initializers and static 
initialization blocks. Initialization is permitted to result only from an active use 
of the class, which is precisely defined in the JLS, but informally corresponds to 
instance creation, method invocation or assignment to a class variable. 

The PJama prototype was designed and implemented prior to the release of 
the JLS and treats initialization as an integral part of the loading process. That 
is, if a class is deemed reachable by the resolution phase it will be loaded, linked 
and initialized. Therefore, the PJama prototype violates the JLS specification. 
This is rarely observable but occasional causes messages related to initialization 
failures to appear during a checkpointij. 

Note that OPJ must preserve the basic invariant of the class loading mecha- 
nism, which is that a class is loaded at most once in a given classloader. In the 

^ “Active use” is defined in the JLS. 

® We will remedy this fault. 
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following sections, therefore, the term “load and resolve” should be understood 
to ignore already loaded classes. The term “promote” is used to denote the pro- 
cess of making any object or class persistent in the store. Similarly, promotion 
should be understood to ignore already promoted classes. 

In PJama we have considered the following approaches to handling class 
reachability: 

1. Use the reference virtual-machine class-loading algorithm. At a checkpoint 
promote those classes that are reachable from the persistent roots. As a result 
classes promoted to the persistent store might contain symbolic references 
to classes that had not been loaded at the time of the checkpoint. These 
references would remain until an active use of the referenced class occurred, 
during a later execution. Then the problem arises that the resolution might 
lead to an different (perhaps incompatible) version of the class than it would 
have at the time of the initial promotion, or it might fail to find the class at 
all. Either case is a loss of referential integrity. 

2. As case 1 but eagerly load and resolve the transitive closure of referenced 
classes every time a class is loaded. This is the most extreme case permitted 
by the JLS, and would prevent unresolved references occurring in promoted 
classes, at the cost of increased execution time and an increased demand for 
heap space. 

3. Similar to case 2, but defer the determination of the transitive closure un- 
til a checkpoint operation occurs, and limit the resolution process to those 
classes reachable from the persistent roots. This delays the cost of loading 
the additional classes until the last possible moment before they are needed 
to ensure referential integrity in the store, but may add considerable la- 
tency to the checkpoint operation. In addition, to comply with the JLS, the 
checkpoint operation would have to be defined as causing arbitrary loading 
in order for the programmer to be expected to catch any exceptions that 
occurred during the process. 

4. Similar to case 3, but simply capture the class-files, as byte arrays in the 
persistent store, of those classes in the transitive closure that are not already 
loaded. That is, defer the actual loading, linking and initialization, until a 
subsequent active use of the class. Modify the class-loading algorithm to 
search this set before the normal class-searching algorithm. In effect this 
scheme captures part of the external file system in the persistent store. 

Cases three and four can be altered slightly by defining reachability to be based 
on all classes that are currently loaded. This variant is called “conservative” by 
analogy with conservative garbage collection, because it may load more classes 
than strictly necessary. PJama currently uses the conservative variant of case 
three, but we now believe that case four would be a better choice, as it provides 
essentially the same consistency properties but at reduced cost at checkpoint 
time. 

Note that none of the above cases deal with references to classes through 
dynamic class naming, for example Class . forNaune (someString) . If the pro- 
grammer wishes to capture the set of classes that someString might name, then 
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the programmer must arrange for such classes to be statically reachable (named 
explicitly as identifiers in the source code). Note that with OPJ, this can be 
done separately from the code that uses dynamic naming, for example, in an 
application-initialization class that is executed simply for the effect of forcing 
the classes to load and be made persistent. 

Problem 3.4- A precise definition of persistence by reachability is required which 
takes into account the need to make behaviour completely persistent by preserv- 
ing an appropriate set of classes 

Problem 3.5. Given the previous definition, static initialization must occur ex- 
actly once at an appropriate time. That is, on the first active use of the class, in 
any execution against the persistent store. 

3.3 Explicit Persistent Roots 

When applying the principle of transitive persistence, it is possible to allow 
the programmer to define the roots explicitly, rather than simply use the roots 
defined by the Java language semantics. If threads cannot be made persistent, 
as is the case in many current systems, this is a pragmatic choice, as the only 
other option is to use the static variables in the set of loaded classes. Given the 
historical uncertainty over the semantics of class reachability, coupled with the 
fact that in the initial version of the Java language a static variable could not 
be marked transient, an explicit, user-programmed root table was the obvious 
way to proceed, and this is the current mechanism that is used by PJama. 

Unfortunately, the existence of this table and the associated interface in- 
evitably violates the third principle of persistence independence. Arguably, the 
explicit root table should be dropped in favor of the standard reachability rules 
of the Java language. However, because this would make all core classes implic- 
itly persistent, it would require a complete solution to the issue of external state, 
described in detail in section 0 It would also make conservative class loading the 
default, as all loaded classes would be reachable, by definition. This issue seems 
to hinge primarily on the level of orthogonality of the system as a whole and on 
the quality of the evolution tools (see section 0 ). 

Problem 3.6. The precise definition of persistence independence and persistence 
by reachability must encompass a mutually consistent definition of the persistent 
roots. 

3.4 Persistence Independence 

The purpose of this principle is to support the re-use of software irrespective 
of object lifetime. The fundamental requirement of persistence independence 

® There are subtleties due to the possibility of multiple ClassLoaders. For example, if 
a second attempt is made to load a class of the same name using a different loader, 
and there is a preserved copy in the persistent store and a revised copy in the file 
space identified by the CLASSPATH which variant should be used? 
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is that source code should not have to be modified in order for a class to be- 
come persistent. Typically required modifications include inheriting from special 
classes, requiring specific constructors, requiring explicit method calls to transfer 
an object from persistent storage and so on. 

The Java platform defines a public format for compiled classes |23|. This 
permits the source form of a class to undergo arbitrary transformations before 
execution, provided that the resulting classfile defines a class of the same name 
and also that it passes the verification tests of the Java virtual machine. 

This has led to several systems that achieve a form of persistence indepen- 
dence by effectively rewriting the source code to insert the read and write barriers 
and other changes that are needed to support transparent persistence lEZEEl- 

It is debatable whether such systems meet the principle of persistence inde- 
pendence. Certainly the source code does not need to be modified, which is the 
first priority of this principle. However, the goal of re-use, which is the larger 
purpose of this principle, is compromised by classfile transformation. Typically, 
the transformed bytecodes cannot be used outside of the particular persistence 
framework, which seriously damages the “write once, run anywhere” goal of the 
Java platform. For example, the popular JGL collection classes are available in 
two versions, standard and transformed, for use with Object Design’s PSE sys- 
tem m This bifurcation would become intolerable if variants were needed for 
a wide variety of different implementations of persistence. 

In the general case, supporting a system of dynamically loaded classes that 
need post-processing requires bundling the pre-processor with the application 
and loading such classes with a customized class loader. Since this requires fore- 
sight on the part of the programmer it therefore violates the principle of persis- 
tence independence. It also rules out the core classes, which cannot be loaded 
by this mechanism. They may indeed be preloaded and therefore excluded from 
persistence via this implementation strategy. But they earn their presence in the 
core classes because they are frequently re-used. 

Finally, complete transparency would require that the transformation step be 
completely hidden from debuggers, performance monitoring tools, the reflection 
system and so on. Achieving this degree of transparency is quite unprecedented 
at the current state of the art. 

Problem 3. 7. The principle of persistence independence requires careful defini- 
tion so that it covers all relevant aspects of application programming without 
being over restrictive. A good definition is essential for ensuring the re-use of 
classes across persistent and non-persistent systems. 

4 Controlling Checkpoints 

The principles of orthogonal persistence are not explicit about whether the 
programmer has any control over when the state of the computation becomes 
durable, so that it will survive a shutdown of the system, intended or involun- 
tary. In a perfect implementation of orthogonal persistence in a closed world. 
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this would not be an issue. Provided that the system checkpoints its state with 
a sufficiently fine grain, the only slight discontinuity would be after a crash, and 
even then the system would recover transparently and continue from the last 
successful checkpoint m- 

In practice this idealized model is unrealistic, and must be modified to deal 
with external state that is defined outside the Java environment, but accessed 
within it. Handling external state is covered in detail in section 

However, even in a closed world, it is necessary to address how the program- 
mer can control the checkpoint mechanism and to specify the special cases of 
shutdown and restart. 

The precise details of the Java virtual machine start-up are defined by the 
JLS to be implementation dependent. The general scheme is to load a specific 
named class and then invoke its main method, which has a special signature, in 
a thread. It is implementation and application dependent whether other threads 
may be created prior to the invocation of main. A virtual machine terminates 
its activity and exits when one of the following two things happens: 

— All the non-daemon threads terminate. 

— A thread invokes the System. exit method and the exit operation is not 

forbidden by the security manager. 

A daemon thread is defined by the fact that its existence does not influence the 
decision to exit the virtual machine. 

Let us consider the start-up of a fresh orthogonally persistent Java virtual 
machine and a completely empty store, that is, the store contains no threads 
and no loaded classes and no other object instances. Assume that a class Main 
is indicated as the entry class and assume also that no additional threads are 
created. There are three cases to consider: 

— The thread terminates normally without invoking a checkpoint. 

— The thread terminates abnormally without invoking a checkpoint. 

— The thread invokes a checkpoint, and then terminates abnormally. 

Abnormal termination is defined as an uncaught exception in the thread, or a 
call to System. exit with a non-zero argument. 

In the first case the thread will have caused the loading of some classes, often 
a very large number, and possibly created some persistent roots. We choose to 
interpret normal termination as requiring an implicit checkpoint. By definition, 
the terminated thread is not reachable, therefore the store will contain only 
passive objects and no persistent execution state. 

In case two, we choose to interpret the abnormal termination as indicating 
that no implicit checkpoint should occur. Therefore the store remains unchanged. 

In case three, the explicit checkpoint will find that the thread is active and 
reachable. Therefore the state of the thread should be made persistent in the 
store. The subsequent abnormal termination will, as in case two, avoid the im- 
plicit checkpoint. 

Now consider the start-up of the virtual machine with the store in the state 
left by each of the above cases. In cases one and two the store contained no 
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persistent threads and it is therefore necessary to create a new thread and also 
for the start-up process to be provided with the name of the entry class. This 
class may or may not already exist in the persistent store; if not it will be loaded. 

Case three is more interesting. Since the persistent store contains a live 
thread, there is no need to create a new thread and also no need to specify 
an entry class. The virtual machine can simply resume the suspended thread. 
But what if an entry class is specified? This situation is interpreted as indicating 
that an additional thread should be created to invoke the given method, caus- 
ing a total of two live threads. This case generalizes to a store that contains an 
arbitrary number of live threads. 

The astute reader will have noted that in case three, the live thread in the 
store may resume and then terminate abnormally, as it did during the first 
activation. This is an example of a persistent bug, a known phenomenon of 
persistent systems 1231 . However, if a new entry class is specified and executes 
in a fresh thread, this will not matter, as the new thread has the opportunity 
to alter the behavior of the system. Another mechanism for handling this case 
is outlined in section 

Problem 4-1- The semantics and utility of persistent active threads requires fur- 
ther exploration. Their interaction with JVM start up and shut down require a 
careful definition. 



4.1 Transactional Interpretation 

The decision to interpret successful termination as an implicit checkpoint and 
abnormal termination as abort is similar to the commit and abort operation of 
a transaction m- Indeed one can consider one execution of the JVM as a single 
flat transaction. In transactional terms the explicit checkpoint is like a chain 
transaction, which is a transaction that commits periodically, relinquishing the 
ability to abort those changes, but does not release control of the resources it 
is using. In the OPJ model, the virtual machine is the sole owner of all the 
resources and so a checkpoint is equivalent to a chain transaction. In particular 
an explicitly invoked check-point cannot be undone. 

Although it is possible to construct an application system as a set of method 
invocations on an OPJ virtual machine, each acting as an ACID transaction, this 
approach has some drawbacks. First it assumes an external environment that is 
capable of activating the virtual machines, and so is not an appropriate model 
for some applications. Second, there is inevitably some latency and overhead in 
launching and closing down a virtual machine. 

In languages that do not support concurrent threads, concurrency has to be 
simulated with multiple processes (virtual machines). This model is common, for 
example, in many object-oriented database systems, and has been carried over 
into their Java variants. However, since the Java language provides concurrency 
through lightweight threads, the multiple-process model is strictly unnecessary. 
A single JVM can support a range of independent activities, running as separate 
(groups of) threads, and, if the orthogonality of the system is complete, it will not 
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matter when and by which threads checkpoints are initiated. However, all these 
threads are running in a shared object space and it is important that they do 
not accidently interfere with one another. The Java language supports this with 
explicit object locking through the use of synchronized methods and blocks. 
This puts the onus on the programmer to get the locking correct, a task that is 
known to be difficult and error prone . In particular synchronized methods 
do not provide effective support for coordinating sequences of operations on 
a group of objects, owing to the inability to recover from deadlock caused by 
acquiring individual locks in an incorrect order. 

Transactions, which originated in the database field, are explicitly designed 
to support an isolated, recoverable unit of work that consists of multiple op- 
erations. To ensure consistency, the transaction approach uses implicit object 
locking, holding locks until the transaction commits or aborts, thus ensuring 
an equivalent serial ordering exists. A key feature of the transaction approach 
is that the locking is dynamic and orthogonal to other properties of an object, 
whereas synchronized methods are a static, compile-time property. 

This difference in the locking model is enough to suggest that adding transac- 
tions to the Java language is a significant change, and effectively defines a variant 
of the language. Another aspect of the transactional approach that significantly 
changes the programming model is the ability to abort (undo) part of a com- 
putation. This is a very powerful mechanism that comes at negligible additional 
cost, since it is already required to support the basic properties of atomicity and 
durability. However, we should note that extending this capability orthogonally 
to short-lived (transient) objects is not trivial for an implementation. Indeed, at 
the time of writing, none of the commercially available transaction systems for 
the Java platform undo transient state on a rollback or abort, and this limitation 
leads to many subtle bugs in applications involving transient state. 

In parallel with the basic provision of orthogonal persistence for Java, the 
PJama project has from the outset also included research on a sophisticated 
extensible transaction model PE2I, although this has not yet been implemented 
in any of publicly available prototypes. However, whereas all commercial ven- 
dors of persistence solutions for the Java platform only provide persistence in a 
transactional framework, we believe that the provision of complete orthogonal 
persistence for the Java language is independent and more fundamental than a 
transaction mechanism. 

Despite the potential pit-falls it is a fact that the Java language already 
supports a highly customizable concurrency control model, which is widely used 
in existing code, particularly in sophisticated applications which could not be 
constructed with flat ACID transactions. Adding orthogonal persistence to the 
Java platform opens up new territory by allowing the concurrency model to be 
fully exploited in long-lived applications. It seems to us that this territory should 
be explored in parallel with the development of transaction models. 

It may turn out that the power and orthogonal properties of an extensible 
transaction model would be a significant addition to the Java language, but its 
adoption may require abandoning, or severely curtailing, its current approach to 
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concurrency control. In this context, it is important to understand a fundamen- 
tal difference in approach between the PJama extensible transaction model and 
that of the current crop of commercial systems. The latter is essentially aimed 
at controlled access to an external object space by multiple instances of a JVM, 
whereas the former is concerned with controlled access by multiple threads to the 
object space defined by a single virtual machine. In this sense the PJama exten- 
sible transaction model is more compatible with the Java language specification 
as it is designed to operate in a similar closed- world environment. 

There are many open issues concerned with the relationship between the 
checkpoint model of orthogonal persistence and the transaction model, some of 
which are discussed in 0. Regardless of how these issues are resolved, it seems 
likely that the basic principles of orthogonal persistence will play an important 
role in any transaction system. In particular, the much sought after feature of 
a “long-running transaction” can obviously benefit from complete orthogonal 
persistence. 

Problem 4-2. There is a rich space of possible combinations of orthogonal persis- 
tence and transactions. Their combination with the Java languages concurrency 
model is not straightforward. There are both semantic and practical problems 
deserving further investigation here. 



5 Handling External State 

Almost all of the problems that application developers experience with orthog- 
onal persistence are caused by the need to deal, explicitly or implicitly, with 
external state that is not under the control of the system. Even when an or- 
thogonally persistent application supports a rich and long-lived set of objects, 
there will be times when it is necessary to communicate with external systems, 
some of which will not be written in the Java language. In fact, because of the 
safety properties of the Java language, part of the implementation of the runtime 
system must be written in another language. 



5.1 Incompatible Models of External State 

A general model for dealing with external state is to treat the application’s ac- 
cumulated information about external state as a cache that is discarded when a 
system shuts down, either purposely or due to a crash, and needs to be recon- 
structed when the system restarts. This will sometimes require saving the values 
representing the external state in a different form and using this to re-establish 
the active state during restart. For example, the operating system descriptor FD 
of an open file with pathname PN will not be valid across restarts, but PN can 
usually be used to re-open the file returning an equivalent descriptor FDneio0. 
Persistence at the operating-system level attempts to reduce the 
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external state to an absolute minimum. Network communications still present a 
problem, however. 

Note that managing external state of this sort requires the programmer to 
do extra work. That is, the code cannot be persistence independent. In fact, this 
is an example where supporting orthogonal persistence requires the application 
programmer to do more work than in a transient prograir0 Using the open 
file as an example, it is likely that the name of the file to open is passed in 
as an argument to a method of the class that holds the file descriptor. In a 
transient application this execution path will always be followed, which will 
cause the file descriptor to be initialized each time, before it is used. However, 
an orthogonally persistent application logically continues from the point that it 
reached before the last checkpoint, and so will not usually retrace this execution 
path. Therefore, unless the programmer takes extra steps to ensure the validity 
of the file descriptor value, by re-opening the file during restart, the application 
will fail when it tries to use the FD value. 

This is an example of a fundamental incompatibility between the program- 
ming model of orthogonal persistence and the traditional transient programming 
model. The incompatibility arises from the (natural) position taken in OPJ that 
variables are persistent by default, which clashes with the reality that in a tran- 
sient program all variables are re-initialized on each run. Unless the programmer 
is aware of the possibility that the code might run in an orthogonally persistent 
environment, it is easy to forget to handle the subset of variables that are truly 
transient because they deal with external state. External state may change be- 
tween activations of the OPJ system since it is managed independently. Indeed 
the reason that it is desirable for it to remain external is to take advantage of 
these other, pre-existing, systems. 

This conflict between the transient and persistent programming models is 
visible in many places in the core-class libraries. A classic example is the standard 
technique for loading native-code libraries, which assumes that it is safe to do 
this in a static class initializer. However, in an OPJ system, a class is initialized 
exactly once, so on a second invocation of the program, the library is not loaded. 
There are many other instances of variables that hold values that are intrinsically 
transient, but they are not distinguished in the source code. 

This is particularly unfortunate because the Java language provides a key- 
word to denote that a value should not be persistent, namely the transient 
modifier. In many cases, all that the programmer needs to do is to mark the 
variable as transient and then handle the restart case by checking for a distin- 
guished value, typically null. Fortunately, this issue also applies to Java Object 
Serialization (JOS), which means that quite a lot of existing code that deals with 
external state already works correctly when used in a the normal transient Java 
platform or in the OPJ platform. However, the solution is not complete because 
serialization is itself not complete. Some classes cannot be serialized, and static 
variables are assumed by serialization to be transient. Often it is precisely these 
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classes and variables that deal with external state. Sadly, as explained in detail 
in the transient modifier has now been specified in the latest version of the 
JLS in a way that is incompatible with its natural application for orthogonal 
persistence. 

One other variant of external state that can cause incompatibilities with 
OPJ, is that which holds the context in which the application (virtual machine) 
is assumed to be executing. A classic example of this is the notion of a locale, 
which defines the human language with which the application should communi- 
cate with a user. 

This differs from the previous kind of external state in an important way, 
namely that it is usually not the case that the internal form of the state is in- 
validated by a shutdown and subsequent restart. Rather the external context 
determines the particular set of objects that will be used (instantiated). Cur- 
rently, in the Java platform, this kind of contextual choice is often achieved using 
the mechanism of dynamic class loading. For example, the locale (internation- 
alization) support defines an elaborate algorithm for choosing which classes to 
load based on external values in the environment. 

In general, this programming model is incompatible with OPJ, again for the 
reason that classes are not reloaded when a suspended execution resumes. This 
can be handled in the same manner as other external state, namely through use 
of the transient modifier and action handlers (see below). However it is not 
entirely clear that this information should always be transient. For example, if a 
user initiates an application in one physical locale, but resumes its execution in 
another, he/she probably does not want the application to change its behavior 
with respect to the language it uses to communicate. On the other hand if the 
user delegates the task to a co-worker whose native language is different, it would 
be appropriate to change the locale, even while running on the same machine. 

The current mechanisms in the Java platform cannot capture this kind of con- 
textual change. A better solution would be to preserve the locale by default in 
the persistent store, and provide an explicit mechanism that is application-driven 
to change the locale explicitly. This would result in incrementally capturing the 
information pertaining to different locales. Note that the problem is not per se 
the use of the algorithm to identify the classes for a locale, it is the assumption 
in the Java programming model that classes are always loaded whenever an ap- 
plication resumes. This assumption itself is driven by the traditional separation 
of code and data, a notion that is fundamentally at odds with object-oriented 
programming. 

Finally, the exact behavior of many of the class libraries is dependent on 
external property files as defined by the class java. util. Properties. In OPJ 
one would expect that much of this configuration data would be stored and 
manipulated directly in class instances. 

5.2 Native Code 

The Java language semantics are mainly defined in a closed world that con- 
sists solely of Java entities. A major exception is the provision for a method 
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to be annotated with the native modifier, which indicates that this method is 
implemented in some other language. 

The language definition makes no attempt to embrace external notions such 
as graphics devices, network adaptors, or generalized input and output. Instead 
this is left to standard class libraries, some of which make use of the native 
modifier to escape the constraints of the Java language, and to exploit legacy 
code written in other languages, typically C. 

The principles of orthogonal persistence apply straight-forwardly to the closed- 
world part of the Java platform, but it is much less obvious how to deal with 
the native modifier and the state that exists outside the Java environment. It 
is evident that a complete solution would require the principles of orthogonal 
persistence to be applied to the language used to implement the native code. 
In general this is impossible because the principle of transitive persistence can- 
not be applied to a weakly- typed language like C. However, we must mention 
the alternative approach taken by the persistent operating system community 
mum, which is essentially to define a machine-level semantic model in terms 
of tables of memory pages. This provides persistence by reachability of memory 
pages, but leaves it up to each language implementation how to map its semantic 
model to the memory-page abstraction. This can be challenging, for example, 
when considering the efficient implementation of garbage collection. There are 
other problems arising from the “page” versus “object” mis-match, in particu- 
lar the handling of concurrency control. In addition, devices that do not map 
easily to the memory model, for example a network adaptor, still require special 
treatment. 

In summary, there really is no escape from the problem of external state, 
and every implementation of orthogonal persistence must choose how much ef- 
fort should be expended to achieve the ideal properties of the closed world. It 
must also provide interposition mechanisms, where they are appropriate, for 
application programs to define policy. 

5.3 Completeness 

A fundamental decision is where to drawn the line in providing complete sup- 
port for orthogonal persistence. Our experiences with PJama and its associated 
user community are that the standard platform, namely the virtual machine 
and the core classes, must support orthogonal persistence completely. Most of 
the problems that users have complained about are directly or indirectly a con- 
sequence of lack of orthogonality in the standard platform. The chief problem 
area is the Abstract Window Toolkit (AWT), which is heavily used in most Java 
applications. Unfortunately, AWT relies extensively on native code, including 
third-party libraries, in particular native window-system implementations. The 
recent release of the “Swing” user-interface components PH, which maintain 
their state as Java objects, has alleviated the problems, but Swing still relies on 
several fundamental AWT mechanisms that, in turn, depend on native code. 

The move to adopt a standard and more abstract interface to native code, 
embodied in the Java Native Interface (JNI), bodes well for handling arbitrary 
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native code in an OPJ environment, because the required modifications can be 
achieved by modifying just the implementation of JNI itself. 

5.4 Mechanisms for Handling External State 

In principle the transient modifier could provide a basic mechanism for handling 
external state, if it is defined appropriately. The specification in JLS 1.0 is only 
suggestive of the behavior on output, or checkpoint in the case of OPJ, but it 
suggests that the persistent state of the object does not contain variables marked 
transient at all. The specification is silent on input or restart, but in keeping with 
the normal behavior of a transient program, we can surmise that a new object 
should be created and populated with the values of the persistent fields. From 
the OPJ perspective, however, a new object cannot be created because doing so 
would break the guarantees on object identity. Instead, we can simply arrange 
that any transient fields are set to a specific value. In any case, when readying 
the object for active use, it is not entirely obvious what the specific value for a 
field marked transient should be. There are several choices: 

1. the default value for the type of the variable. 

2. the value given by an initializing expression, if any. 

3. the value after executing the default constructor. 

4. the value of the field at the point when the last checkpoint occurred. 

The simplest is undoubtedly choice 1, and this is the choice made by JOS, which 
has become the persistence service alluded to in JLS 1.0. Choices 2 and 3 would 
more closely approximate the state when a fresh object is created with the new 
operator, but suffer from having to execute arbitrary user-code to effect the 
initialization. Choice 4 is equivalent to ignoring the transient modifier. 

When the object is active, the value of a transient field is part of the state 
of the object and presumably is a factor in any of its invariants. From this we 
can conclude that it will be necessary to re-establish the state (invariants) before 
methods of the object can execute correctly. In particular, the OPJ programming 
model requires that the state be re-established to a value equivalent to case 4. 

There are two techniques for re-establishing the state. The first is to direct 
all access to the field through a check for an uninitialized state, e.g: 

if (transientField == null) { 

transientField = reEstablishStateO ; 

} 



Since a checkpoint might be initiated by a separate thread after the check for 
uninitialized state, care must be taken not to assign an intermediate value to 
transientField, which might lead to an initialized but incorrect value on sub- 
sequent restart. 

This adds an overhead to all code that will access the field and is one reason 
why initializing expressions and constructors are useful, since they establish the 
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state initially and so avoid the check. However, such checks offer a significant 
defence mechanism for objects that may be used with arbitrary and complex 
populations of other objects. Such sanity checks on critical state that may have 
been corrupted by independent action at least discover erroneous conditions and 
may lead to more robust applications. 

The second technique is to factor out the code that re-establishes the state 
into a designated restart method, and arrange for this to be invoked before 
there is a possibility for a client to invoke one of the object’s methods. Since 
this code might fail, and so raise an exception, or have arbitrary side-effects, 
there is an issue as to when this code is executed. In an OPJ system with a 
very large number of objects, it might be expensive and introduce unacceptable 
latency to process every object with a transient field on start-up, which suggests 
a lazy invocation scheme might be more appropriate. On the other hand it is not 
clear how the programmer is expected to deal with the apparently asynchronous 
invocation of the restart method. 

JOS handles this situation as part of its general mechanism for interposition 
on the serialization and deserialization of an object. Since serialization processes 
an entire object graph from one explicit call by the user, and carefully defines 
the order of calls to user-defined restart handlers, the asynchronous invocation 
problem does not arise. 

Since the OPJ model of persistence is implicit, a different approach is taken. 
This allows the programmer to adjust the state of the application on restart 
of the virtual machine, with the aim of re-establishing the state as closely as 
possible to that which pertained at the last successful checkpoint. In PJama this 
is achieved by action handlers, which are a generalization of the restart method, 
and also handle the checkpoint, shutdown, abort and recover situations. 

Action handlers are defined by an interface, PJActionHandler, that de- 
clares the methods: onStabilization, onStartup, onShutdown, onAbort and 
onRecoveryStartup which are called by the PJama system when one of the 
corresponding events occur. Any class may implement this interface to manage 
its state. Handlers are executed synchronously and there is no system support 
for lazy, per-object invocation. Instead this must be programmed explicitly using 
transient fields and the guard check outlined earlier in this section. 

Action handlers have been used in the PJama system to deal with several of 
the core classes that manage external state. For example the problem of ensuring 
that native code libraries are loaded on every run is handled generically by 
modifying the Runtime class to become an action handler, making the list of 
loaded library names persistent and reloading them on restart. 

A typical idiom for using action handlers is to arrange that constructors of 
classes needing external state re-establishing on resumption register each new 
instance on a list. The code in the action handler then scans these lists re- 
establishing whatever external values are needed. 

In an early version of PJama transient fields were reset to their default value 
after every checkpoint, but restart handlers were not run. This had the disas- 
trous effect of breaking the invariants on the values of transient fields, causing 
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applications to fail. In the current version, the values of transient fields are left 
unchanged across a checkpoint. 

Problem 5.1. The model for writing resumable and resiliant code that copes with 
external autonomous state needs to be defined, explained and adopted. 

The Misuse of the Transient Modifier Much of the above discussion is 
centered around using the transient modifier as an integral part of the mecha- 
nisms for handling external state. Unfortunately, since JDK 1.1, the transient 
modifier has been given a semantics that supports a particular programming 
idiom that is specific to JOS, and is incompatible with orthogonal persistence. 

In JDK 1.1, transient has been defined operationally as follows: 

Variables marked transient in a class are not written to the output stream by 
a call of defaultWriteObject () . If a class does not define its own writeObject 
method (the usual case), then its fields are written to an object stream during se- 
rialization by a call of defaultWriteObject. Therefore, by the above definition, 
fields marked transient will not be written to the stream. This is consistent 
with the spirit of transient as suggested by the JLS 1.0, and compatible with 
orthogonal persistence. 

However, even if a class does define its own writeObject, it is still possible 
to call the method defaultWriteObject explicitly. Given the above definition, 
this will write out all the fields that are not marked transient. So, imagine a 
class with ten fields, one of which the programmer wants to map explicitly to a 
representation in the persistent state. An economical way to achieve this is to 
mark the field transient, call defaultWriteObject to write out the other nine, 
and then write the transient field explicitly. Unfortunately, the transient field 
is not really transient in the sense of the JLS 1.0 specification, since the field is 
saved, albeit with a specialized representation. In particular, it is not transient 
in the sense of orthogonal persistence. 

As an example, consider this extract of the LinkedList class from the JDK 
1.2 collections package: 

private transient Entry header = new Entry (null, null, null) ; 
private transient int size = 0; 

These fields represent the entire state of the object. Although one might expect 
this class to be a perfect example for the convenient automation offered by JOS, 
the programmer has chosen to serialize them explicitly, and is using the idiom to 
do so, even though there are no non-transient fields to write out. Note also that 
the transformation of the list representation is not related to handling external 
state. It is simply an optimization to reduce the storage costs in the persistent 
form, and avoid some of the overheads of default serialization. It is ironic that 
the overheads are caused in part by the elabo-rate interposition mechanisms such 
as writeObject. 

A consequence of this idiom is that the natural interpretation of transient 
by OPJ must be avoided, otherwise all classes which use the above idiom will 
be invalid on restart after a checkpoint. 
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Since the concept of trainsient is so useful for dealing with external state, 
a replacement is required. In PJama, we have added a markTransient method 
that marks a field transient in the OPJ sense. The irony is that the trcinsient 
modifier directly supports the OPJ principle of persistence independence, which 
is violated by markTransient. Yet in order to use the serialization idiom the 
programmer is already required to violate this principle comprehensively. 

In fact, in JDK 1.2, an alternate mechanism PH] has been provided for spec- 
ifying which fields to make persistent that might eventually free transient for 
use by OPJ. However, it requires more work by the programmer than the above 
idiom and, evidently, is not being adopted even in the core classes. 

Problem 5.2. The difference between transient and requiring-special-translation 
needs to be recognised. Proper mechanisms for each of these are required. 



6 Class Evolution 

A consequence of the OPJ programming model is that an object, once created, 
cannot change its class. Since the class includes behavior, a persistent application 
system is intrinsically resistant to change, except by addition of new classes. 
This consistency, of course, is one of the strengths of the approach, but not all 
change can be achieved by class addition. Sometimes, it is necessary to alter the 
definition of a class and, in order to maintain the type consistency of the system, 
to alter all the existing instances to conform to the change. 

There are two basic approaches to achieving class evolution in OPJ. The 
first is to treat it as an external action, one that is described and takes place 
outside the OPJ programming model. The second is to provide it as an internal 
action as part of the OPJ programming model, for example, by extending the 
Java reflection API to permit Class instances to be modified at run-time, as 
in Smalltalk. It is unclear what the overall impact of such a change would have 
on the Java platform, but, at a minimum, the implementation challenges alone 
would be significant for an OPJ environment that supported very large numbers 
of classes and instances. A more serious concern is maintaining global consistency 
in the face of a sequence of small changes. It is possible that the transaction 
approach can be used to manage a complex evolution. 

Problem 6.1. Evolution based on reflection appears to be necessary if a contin- 
uous service and maintenance are both to be offered. Both the semantics of the 
reflection operations and their implementation require exploration and definition 
in the context of persistent programming languages. 

The PJama experience with class evolution is very limited to date. We have 
recently implemented a comprehensive evolution tool using the external action 
approach H3- It operates as an off-line activity on a quiescent store. The ef- 
ficiency of the process is fundamentally limited by the implementation of the 
virtual machine and store layer in the current version of PJama. We expect to 
make substantial progress in this area as we begin to use a new store layer m- 
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Evolution is, of course, not restricted to the classes of the application, but 
also affects the virtual machine itself. Indeed, during the development of the 
PJama prototype, there have been many occasions when a new version of the 
prototype required that the application completely rebuild its persistent store. 
This could be solved either by a specific store migration tool with each release, 
as is common in commercial database products, or by a general archive and 
restore utility that would save the application data in a form independent from 
a particular OPJ implementation. Because the intrinsic classes and other classes 
that use C data structures may be intimately inter-related with JVM details, this 
is not straightforward. There are for example complex bootstrap problems. It is 
desirable to discover either abstract reflection methods that can be implemented 
by such classes or some other well-defined interface to facilitate system evolution 
and platform migration. 

Problem 6.2. Research is needed to discover mechanism that permit a populated 
persistent store to be detached from one implementation of a persistent language 
and recoupled to a different implementation of that language. 

Evolution is also complicated by the increasing use of Java as an implementation 
language in the persistence layer itself, because this opens the possibility that an 
application might evolve a class in such a way that it either breaks or seriously 
degrades the performance of the system itself. The absence of clear layers in the 
current Java platform exacerbates this problem. 



7 Automatic Storage Management 

One of the potential benefits of orthogonal persistence for Java is the extension of 
automatic storage management to the domain of persistent objects. While there 
are no conceptual difficulties in this task the engineering involved in developing 
a robust and efficient solution is considerable. The current level of main-memory 
storage management for the Java platform is generally considered to be inad- 
equate, and there continues to be much debate as to the best approach to the 
problem of garbage collection. Experience with garbage collection for large per- 
sistent stores is quite limited and still the subject of research. 

The current PJama prototype retains the main-memory allocation system 
of Sun’s reference JDK implementation, including its garbage collector for the 
transient heap. Allocation in the persistent space is implicitly part of the stabi- 
lization process, but is currently subject to size limitations per stabilization that 
can cause transactions with large numbers of new or modified objects to fail. 
Garbage collection of the persistent store is only available as an off-line process 
at present m, which prevents continuous operation. We expect to remove both 
of these limitations when the new persistent store layer is deployed. Currently 
Gemstone/J provides the only commercially available persistence solution for 
Java with a concurrent persistent store garbage collector. 
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8 Performance of Orthogonal Persistence 

Orthogonal persistence for the Java language is undoubtedly challenging to im- 
plement such that application performance is acceptable. Indeed, it is unlikely 
to ever compete on a purely CPU-bound application executing in a highly opti- 
mized virtual machine. Fortunately such applications exist only in the simplified 
world of naive benchmarks. 

The only realistic performance measure is total application cost. This includes 
application load time, initialization, and the cost of any layered subsystems, such 
as object-relational mapping layers, and a relational database or file system. 
There are inherent and significant costs in the multiple-process architectures that 
are implied in such solutions. We are in the process of performing a comparative 
study of the total application costs for a range of applications and persistence 
solutions and hope to report on this in the near future. 

Two events have changed the performance overhead of PJama since we last 
reported our experiences m- The first is that since JDK 1.1 the standard in- 
terpreter loop has been rewritten in assembler code. This achieves roughly a 
factor of two speedup. The second is the emergence of Just-In-Time (JIT) com- 
pilers for Java bytecodes. The public release of PJama has not been modified 
to use an assembler interpreter loop. Nor have we felt it worthwhile to support 
a JIT, which would require modification to insert read and write barriers in 
the generated code. Internally, we have measured an assembler-loop version of 
PJama as incurring an over-head of 14one benchmark, which is consistent with 
the overhead reported in HS|, but carries the caveat of a singleton experiment. 

It is clear that the overhead of OPJ inevitably increases as the performance of 
the basic engine improves. Work is underway to exploit optimization techniques 
[4!t>f 1 7171 1 DIDIJ to further reduce the overhead of OPJ and ultimately we expect 
to benefit from the more comprehensive dynamic optimization frameworks such 
as Java Hotspot"'"'^ . However, we still believe that total application cost is the 
critical measure and that, as with automatic storage management, it is accept- 
able to use a percentage of the annual increase in computing power to simplify 
the development of robust and reliable software. 

Problem 8.1. There are opportunities to exploit orthogonal persistence in op- 
timizers. For example, analysis structures can easily be retained and re-used. 
There are also opportunities to increase persistent language performance by fur- 
ther research into optimizations that reduce barrier insertions, etc. Optimization 
in this context should be validated using total system measurements on appro- 
priately realistic applications. 

9 Distribution 

It should be clear from the earlier discussion that, just as the Java language 
specification does not directly address the issue of distribution, neither does or- 
thogonal persistence for Java. Although the specification permits an implemen- 
tation that maintains sets of objects at different nodes in a network, nothing in 
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the Java language specification allows these sets to be distinguished. Further- 
more, stabilization would be required to be transparent across all the nodes, 
although, of course, it might fail more often than with a single node. This is 
the so-called one-world model of distribution, which is known to be unrealistic 
beyond relatively small and tightly-bound systems of nodes. 

The standard mechanism for distribution in the Java platform is Java RMI, 
which is more akin to the federated model of distribution, where nodes are 
autonomous, and programmers must explicitly deal with communication failure 
and issues of object identity. Java RMI is actually a hybrid system, as it offers 
aspects of one- world model for objects that implement the java.rmi. Remote 
interface, including distributed garbage collection. However, programmers must 
explicitly handle method invocation failure due to communication problems, and 
some standard methods, for example. Object. hashCode, operate on the local 
proxy object and not on the object to which the proxy refers. 

Since Java RMI, which is almost completely written in Java, uses the socket 
and input and output mechanisms of the Java core classes, it falls into the 
category of code that depends on external state discussed in section 0 Perhaps 
not surprisingly therefore, when standard Java RMI is used with the PJama 
virtual machine, it does not work correctly once the Java RMI classes are made 
persistent. To remedy this, the PJama system now includes a modified version of 
Java RMI that exploits the PJama action handler framework to work correctly 
when a persistent store is resumed. In addition, it supports the transparent 
persistence of both exported server objects and client stubs. This scheme is very 
easy to use in comparison with the programming-intensive mechanisms that are 
required by the Java RMI Activation framework, which is a new feature of Java 
RMI that provides similar capabilities in the JDK 1.2 release. 

While the support for persistence of Java RMI objects is valuable, it would be 
preferable if a more orthogonal distribution mechanism were available for Java 
programmers. The arguments that support the principles of orthogonal persis- 
tence apply equally to the field of distribution despite the fact that the realities 
of latency and partial failure cannot be ignored completely. An example of the 
benefits that orthogonal persistence could provide to distribution is the provision 
of stable, globally unique, identity-based hashcodes 0. Longer term, the extensi- 
ble transaction model might be a useful mechanism for handling communication 
failure in addition to handling transaction conflicts and media failure. 



10 Conclusions and Further Work 

While the general principles of orthogonal persistence are well understood, there 
is only limited experience in applying them to specific languages. In applying the 
principles to the Java language, a number of subtle issues arise, which have only 
been resolved through the experience of application development. These experi- 
ences have confirmed the principles to be fundamentally sound and a significant 
contribution to simplifying the otherwise complex task of managing persistent 
state. 
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The work on PJama developing orthogonal persistence in the Java platform 
has been an interesting as, unlike earlier experiences with academic research 
languages, we were not only presented with a langauge specification that was 
a fait accompli, but that languages platform has been evolving rapidly. It is 
significant that we have been able to show that the existing langauge and its 
evolutions can be supported. There are certainly several cases where forethought 
about persistence would have yielded improvement, but for most of the language 
its integration with orthogonal persistence has been straightforward. This could 
be said to vindicate the claim in ^ that orthogonal persistence was now ready 
for industrial use. 

Our experiences have demonstrated that the principles of orthogonal per- 
sistence must be applied uniformly to the entire Java platform, as application 
developers inevitably stumble into any omissions, ironically often as a direct 
consequence of transitive persistence. 

It is perhaps surprising to find that support for concurrent transactions 
within a multithreaded Java application has not become a central issue. We 
remain committed to research into flexible transaction models that can operate 
well in this context, but we have also found that there is a demand for dura- 
bility and recovery where applications have been built using Java’s isolation 
and consistency mechanisms. It is not yet clear how the two approaches can be 
reconciled. 

The open nature of the Java language, which contributed greatly to its speedy 
adoption, requires that a simple solution be found to the problem of handling 
state that is external to the Java environment. A combination of the transient 
modifier and action handlers that are invoked at critical points in the execution 
of an OPJ system solves these problems in an economic and efficient manner. 

Several challenges remain before the promise of the OPJ model can be de- 
ployed routinely in a wide variety of application environments. First, the imple- 
mentation of OPJ must be complete and include support for threads. Second, 
and largely a matter of engineering, the implementation must be scalable and de- 
liver performance that is competitive with standard Java environments. Finally, 
and the most critical barrier to adoption, there must be an effective solution to 
the problem of system evolution. 

The main text has highlighted a number of specific topics which deserve 
further attention. In addition there are three more general areas worthy of in- 
vestigation. 

1. Appropriate handling of bulk types. Earlier work |S]2(Jj has suggested that 
standard data structures (collection classes and indexes) can represent bulk 
values and that queries can be evaluated by converting them using reflection 
into code appropriate to both the query and the underlying data structure. 
The development of this approach into “industrial-strength” bulk data han- 
dling, say for SQLJ and OQL, has yet to be demonstrated. This will probably 
depend on a suitable polymorphism extension j^. 

2. Appropriate software engineering approaches. The software engineering 
methodologies that are currently in use are tuned to the two traditional 
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data processing models described above (input-compute-output and query- 
compute-update-repeat). More incremental and event driven models of pro- 
cessing become available with orthogonal persistence and it is necessary to 
develop corresponding software engineering methodologies. Until such ap- 
proaches, or at least the underlying design patterns, are made explicit, it 
will be difficult to get application programmers to convert from the tradi- 
tional processing models and to benefit from orthogonal persistence. 

3. Program development and administration systems are needed for persistent 
programming. For example, new build models that include evolution must 
be supported, tools are needed to visualize the content and operation of the 
persistent store, and installation technology is needed to ship and install 
replacement sets of classes and objects in populated production stores. 
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Abstract. The realities and opportunities of the global information en- 
vironment enable and necessitate new techniques and approaches, they 
expand the scope and the methodology of database theory. This talk sur- 
veys recent results of this nature, by the author and/or several collabo- 
rators, in two areas. In information retrieval, spectral methods have been 
introduced which extract the hidden semantics of a corpus by analysing 
the eigenvalues of related matrices. In fact, the performance of such meth- 
ods can be theoretically predicted to be favorable in a certain statistical 
sense. A different spectral method has also been introduced successfully 
in the analysis of hypertext so as to identify authoritative sources of in- 
formation. In data mining — the search for interesting patterns in data — 
we argue that a meaningful dehnition of “interesting” requires considera- 
tion of the optimization problem the enterprise is facing. This “microeco- 
nomic” view leads quickly to certain novel and interesting computational 
problems. 
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Abstract. Description Logics are logics for representing and reason- 
ing about classes of objects and their relationships. They can be seen as 
successors of semantic networks and frame systems, and have been inves- 
tigated for more than a decade under different points of view, in particu- 
lar, expressive power and computational complexity of reasoning. In this 
short paper, we introduce Description Logics, we compare Description 
Logics with Database models, and then discuss how Description Logics 
can be used for several tasks related to data management, in particular 
information integration, and semi-structured data modeling. 



1 Introduction 



The idea of developing knowledge representation systems based on a struc- 
tured representation of knowledge was first pursued with Semantic Networks 
and Frames. Semantic Networks m represent knowledge under the form of a 
labeled directed graph. Specifically, each node is associated with a concept, and 
the arcs represent the various relations between concepts. Frames ■HD represent 
concepts (or classes) and are characterized by a number of elements called slots, 
each of which corresponds to an attribute that the members of the class can 
have. 

The attempt to provide a formal ground to Semantic Networks and Frames 
led to the development of the system kl-ONE El More recently. Description 
Logic^ (DLs) have been proposed as successors of kl-ONE, with an explicit 
model-theoretic semantics. The family of DL systems includes systems like kryp- 
ton uni , NIKL 1^, BACK m , LOOM inoi , CLASSIC 1^, KRIS 0, and others (see 



In DLs the domain of interest is modeled by means of concepts and rela- 
tionships, which denote classes of objects and relations, respectively. Generally 
speaking, a DL is formed by three basic components: 



— A description language, which specifies how to construct complex concept 
and relationship expressions (also called simply concepts and relationships), 

^ See http://dl.kr.org for the home page of Description Logics. 
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by starting from a set of atomic symbols and by applying suitable construc- 
tors. 

— A knowledge speeifieation meehanism, which specifies how to construct a 
DL knowledge base, in which properties of concepts and relationships are 
asserted. 

— A set of reasoning proeedures provided by the DL. 

The set of allowed constructors characterizes the expressive power of the 
description language. Various languages have been considered by the DL com- 
munity, and recent papers aim at formally studying their expressive power under 
different points of view nsini . 

We provide here an example of description language, in particular the one of 
the DL VCRreg, used in We assume to deal with a finite set of atomic rela- 
tionships and con(^t^, denoted by P and A respectively. We use R to denote 
arbitrary relation^ (of given arity between 2 and Umax), E to denote regular 
expressions, and C to denote arbitrary concepts, respectively built according to 
the following syntax 

R;:=T„ I P I {%i/n:C) \ -R | Ri n Ra 
E ::= e \ R|$i,$j | E\ o E 2 \ E 1 UE 2 \ E* 

C::=Ti I A I I CiHCa I 3E.C | 3[$f]R | (< fc [$f]R) 

where i and j denote components of relations, i.e. integers between 1 and Umax, 
n denotes the arity of a relation, i.e. an integer between 2 and Umax, and k 
denotes a nonnegative integer. 

We consider only concepts and relationships that are well-typed, which means 
that 

— only relations of the same arity n are combined to form expressions of type 
Ri n Ra (which inherit the arity n) , and 

— i < n whenever i denotes a component of a relation of arity n. 

The semantics of expressions is specified through the notion of interpretation. 
An interpretation T = (A^, -^) is constituted by an interpretation domain 
and an interpretation funetion A that assigns to each concept C a subset of 
A^ , to each regular expression E a subset E^ of A^ x A^ , and to each relation 
R of arity n a subset R^ of (Z\^)", such that the conditions in Figure Q] are 
satisfied. We observe that T 1 denotes the interpretation domain, while T„, for 
n > 1, does not denote the n-Cartesian product of the domain, but only a subset 
of it, that covers all relations of arity n. It follows, from this property, that the 
“-i” constructor on relations is used to express difference of relations, rather 
than complement. 

^ In general, DLs consider also individuals besides concepts and relationships, although 
we do not deal with them in this paper. 

® Although most DLs deal only with binary relationships (called roles), VCTZreg does 
not impose this limitation. 
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Using (concept and relationship) expressions, knowledge about concepts and 
relationships can be expressed through the notion of knowledge base. In VCTZreg, 
a knowledge base is constituted by a finite set of inclusion assertions of the form 

Ri C R ,2 

C1QC2 

where Ri and R2 are of the same arity. 

An interpretation X satisfies an assertion Ri C R2 (resp. Ci U C2) if Rj C 
Rf (resp. Cl C Cj). An interpretation that satisfies all assertions in a knowledge 
base S is called a model of S. 

The fundamental reasoning tasks considered in the context of Description 
Logics are knowledge base satisfiability, concept and relationship satisfiability, 
and logical implication. A knowledge base S is satisfiable if it admits a model. 
A concept C is satisfiable in 5 if 5 admits a model in which C has a nonempty 
interpretation (similarly for relationships). S logically implies an inclusion as- 
sertion Cl U C2 if in all models of S the interpretation of Ci is a subset of the 
interpretation of C2- 

Much of the research efforts in DLs in the last decade have been devoted 
to characterizing the trade-off between the expressive power of DLs and the 
decidability /complexity of reasoning. We refer to |24l23j for a survey on this 
subject. 
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2 Description Logics and Database Models 

A knowledge base S in VCTZreg can be seen as a database schema, in such a 
way that a model of S corresponds to a database conforming to the schema, i.e. 
a database satisfying all the constraints represented by S. Under this view, a 
DL can be considered as a data model. An inclusion assertion ACC (where A 
is atomic) specifies necessary conditions for an object to be an instance of the 
concept A, and thus corresponds naturally to the constraints imposed on classes 
by a schema expressed in a traditional database model. The pair of inclusions 
ACC and CCA specifies both necessary and sujficient conditions for the 
instances of A, and thus corresponds to the concept of view used in databases. 

Several papers discuss more precisely the relationship between DLs and data 
models and languages (see, for example, mm\)- All of them emphasize that 
DLs provides modeling features, such as the possibility of expressing incom- 
plete information, that are generally not supported by traditional data models. 
Notably, the reasoning capabilities of DLs can be very useful for supporting in- 
ferencing on a database schema. In this respect, it is important to notice that 
reasoning in Databases is usually done by referring to finite models only. The 
assumption of dealing with finite structures is, however, by no means common 
in Description Logics, and needs to be taken explicitly into account when devis- 
ing reasoning procedures to be used in data modeling. We refer to mm for 
methods for finite model reasoning in DLs. 

3 Information Integration 

Several advanced applications of databases require to integrate information com- 
ing from different sources m, so as to provide a uniform access to data resid- 
ing at the sources. DLs have been used for information integration in various 
ways. For example, F!FTI use DLs for specifying and reasoning about the inter- 
relationships between classes of objects in different sources. use a specific 

DL for describing the conceptual schema of an integration application, and for 
specifying the content of the sources, as views of the conceptual schema. 

In all the approaches, the reasoning capabilities of DLs are used to support 
the task of query answering. In particular, the notions of query containment and 
query rewriting have been shown to be crucial for devising effective algorithms 
that reason about the content of the sources and their relationships with the 
query. Recent results on these aspects for the case of DL-based integration are 
reported in [SCSI. 

4 Semi-structured Data Modeling 

The ability to represent data whose structure is less rigid and strict than in 
conventional databases is considered a crucial aspect in many application areas, 
such as digital libraries, internet information systems, etc.. Following 
define semi-structured data as data that is neither raw, nor strictly typed as in 
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conventional database systems. Recent proposals of models for semi-structured 
data represent data as graphs with labeled edges, where information on both 
the values and the schema of data are kept I2II2I25I . For several tasks related 
to data management, it is important to be able to check subsumption between 
two schemas, i.e. to check whether every graph conforming to one schema always 
conforms to another schema. 

A model of a DL knowledge base can be naturally seen as a labeled graph. 
Moreover, the possibility of expressing both incomplete information and sophisti- 
cated constraints at the schema level makes the family of DLs a good candidate 
for modeling semi-structured data schemas. Again, the reasoning capabilities 
of DLs can be profitably exploited for reasoning about semi-structured data 
schemas and queries, as discussed in nmni. 
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Abstract. We consider the class of path- conjunctive queries and con- 
straints (dependencies) defined over complex values with dictionaries. 
This class includes the relational conjunctive queries and embedded de- 
pendencies, as well as many interesting examples of complex value and 
oodb queries and integrity constraints. We show that some important 
classical results on containment, dependency implication, and chasing 
extend and generalize to this class. 



1 Motivation 



We are interested in distributed, mediator-based systems fWT^ with multiple 
layers of nodes implementing mediated views (unmaterialized or only partially 
materialized) that integrate heterogenous data sources. Most of the queries that 
flow between the nodes of such a system are not formulated by a user but are 
instead generated automatically by composition with views and decomposition 
among multiple sources. Unoptimized, this process causes queries to quickly 
snowball into forms with superfluous computations. Even more importantly, 
without additional oj^imizations this process ignores intra- and inter-data source 
integrity constraints |j. 

It has been recognized for some time that exploiting integrity constraints 
(so-called semantic optimization) plays a crucial role in oodbs and in 

integrating heterogenous data sources IORij5ILbK!)5l. Relational database the- 
ory has studied extensively such issues piVI a,iS3IU I liS9iA H V but in the recent 
literature the use of constraints in optimization has been limited to special cases 
(see the related work in section [^1. In contrast, we propose here a foundation for 
a systematic and quite general approach encompassing the old relational theory 
and a significant class of non-relational queries, constraints and views. A novel 
property of our approach, one that we hope to exploit in relation to rule-based 
optimization as in |CZ9fi) . is that optimizing under constraints or deriving other 
constraints can be done within the equational theory of our internal framework 
by rewriting with the constraints themselves. 



^ In this paper, we use the terms constraints and dependencies interchangeably. 
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In this opening section we plan to show two motivating examples: one illus- 
trating the important effect of constraints on query optimization, the other — 
coming from the announced paradigm of distributed mediator-base systems — 
concerned with deriving constraints that hold in views. The novel ideas to look 
for in these examples are (1) the use of the constraints themselves as “rewrite” 
rules (what we call the equational chase) (2) the ability to “compose” views with 
constraints yielding other constraints that can be checked directly (3) the ease 
with which we handle the heterogenous nature of the examples’ schema (figure 0 
which features a relation interacting with a class. 

Both examples are resolved with the general techniques that we develop 
in this paper. For simplicity these examples are shown in the syntax of the 
ODMG ;att)ti] proposal although the actual method is developed and justified 
primarily for our more general internal framework (see section Oj) . Consider the 
schema in figure Q It is written following mostly the syntax of ODL, the data 
definition language of ODMG, extended with referential integrity (foreign key) 
constraints in the style of data definition in SQL. It consists of a class Dept 
whose objects represent departments, with name, manager name, and DProjs, 
the set of names of all the projects done in the department. It also consists of a 
relation Proj whose tuples represent projects, with name, customer name, and 
PDept, the name of the department in which the project is done. 



Proj : set<struct{ 
string PName ; 
string CustName; 
string PDept ; I> 
primary key PName; 
foreign key PDept 

references Dept::DName; 
relationship PDept 

inverse Dept : :DProj s ; 



class Dept 

(extent depts key DName)-C 
attribute string DName ; 
relationship set<string> DProjs 
inverse Proj (PDept) ; 
attribute string MgrName ; I 
foreign key DProjs 

references Proj (PName) ; 



Fig. 1. The Proj -Dept schema, expressed in extended ODMG 



The meaning of the referential integrity (RICs), inverse relationship, and 
key constraints specified in the schema is expressed by the following logical 
statements. 

(RICl) V(d £ depts) V(s £ d.DProjs) 3(p £ Proj) s = p. PName 

(RIC2) V(p£Proj) 3 (d £ depts) p. PDept = d. DName 

(INVl) V(d £ depts) V(s £ d. DProjs) V(p £ Proj) 

s = p. PName => p. PDept = d. DName 



(INV2) V(p £ Proj) V(d £ depts) 

p. PDept = d. DName => 3(s £ d. DProjs) p. PName = s 
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(KEYl) V(d € depts) V(d' G depts) d.DName = d'.DName =J> d = d' 

(KEY2) V(pGProj) V(p^ G Proj) p.PName = p^PNaIne P~P 

A query equivalence. In the presence of the constraints above, the follow- 
ing OQlH query: 

define ABC_DP as select distinct struct (PN: s, DN : d.DName) 
from depts d, d.DProjs s, Proj p 
where s = p.PNamie and p.CustNcune = "ABC" 

is equivalent to the (likely better) query: 

define ABC_P as select distinct struct(PN: p.PName, DN : p.PDept) 
from Proj p 

where p-CustNamie = "ABC" 

Because it involves both a class and a relation, and because it doesn’t hold 
if the constraints are absent, the equivalence between ABCJDP and ABC_P does 
not appear to fit within previously proposed optimization frameworks. We will 
see that the equational chase method that we develop in this paper can be 
used to obtain this equivalence. Each of the constraints determines a semantics- 
preserving transformation rule. Specifically, one step of chasing with a constraint 
of the form 



V(ri G Ri) ■ ■ ■ y{rm C Rm) [ Bi ^ 3(si G S'!) • • • 3(s„ G 5„) B 2 ] 

produces the transformation 
select distinct E select distinct E 

from Ri fV from Ri ri,...,i?„ r™,S'i Si,...,Sn s„ 

where B\ and C where B\ and B 2 and C 

This transformation may seem somewhat mysterious, but we shall see that it 
can be justified by equational rewriting with the constraint itself put in an equa- 
tional form and with the (idemloop) law (see section OJ . We shall also see that 
when restricted to the relational model, the method amounts exactly to the 
chase |KViS4j . Anyway, chasing ABCJDP with (INVl) gives 

select distinct struct (PN: s, DN : d.DName) 
from depts d, d.DProjs s, Proj p 

where s = p.PName and p.PDept = d.DNaune and p.CustName = "ABC" 

while chasing ABC_P with (RIC2) and then with (INV2) gives 

select distinct struct (PN: p.PNamie, DN : p.PDept) 
from Proj p, depts d, d.DProjs s 

where p.PDept = d.DName and p.PNamie = s and p.CustName = "ABC" 
These two queries are clearly equivalent. 

^ OQL is ODMG’s query language. 
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A constraint derivation for views. Consider again the Pro j -Dept schema 
in figure ^ and the following view, which combines Mnnesting, joining, and 
projecting: 

define UJP_View as 
select distinct 

struct (PN: p.PName, CN : p.CustName, MN : d.MgrName) 
from Proj p, depts d, d.DProjs s, 
where s = p.PNeune 

We will establish that the functional dependency PN — > MN holds in this view, 
namely 

(FD) W{v G UJP.View) V(u' G UJP.View) u.PN = ?;'.PN ^ -u.MN = -uLMN 

Note that standard reasoning about functional dependencies is not applicable 
because of the nature of the view. Indeed we shall see that in addition to the 
key dependencies we will also use (INVl) to establish (FD). 

We perform the derivation in two phases. First we compose (FD) with the 
view, obtaining a constraint on the original Proj -Dept schema. This is done by 
substituting the definition of UJP_View for the name UJP_View in (FD) and then 
rewriting the resulting constraint to the following form: 

(FD-RETRO) V(p G Proj) V(d G depts) V(s G d.DProjs) 

V(p' G Proj) V(d' G depts) V(s' G d.DProjs) 
s = p.PNcune A s' = p'.PName A p.PName = p'.PName 
d.MgrName = d'.MgrName 

Thus, (FD) holds in the view iff (FD-RETRO) holds in Proj -Dept. In the second 
phase we show that (FD-RETRO) follows from the constraints in the Proj -Dept 
schema. The equational chase method applies to constraints as well as to queries. 
Specifically, one step of chasing with a constraint of the form 

V(ri G Ri) ■ ■ ■ y(rm G Rm) [ Bi 3(si G S'!) • • • 3(s„ G Sn) B 2 ] 

produces the transformation 

V(ri G Ri) ■ ■ ■ y{rm G Rm) [ B\ ^ 3(ti G T\) ■ ■ ■ 3{tp G Tp) B 3 ] 

V(ri G Ri) ■ ■ ■ y(rm G Rm) V(si G Si) ■ ■ ■ V(s„ G Sn) 

[Bi A B 2 => 3(fi G Ti) • • • 3{tp G Tp) B 3 ] 

Again, the rewriting that produces (FD-RETRO) and the transformation of 
constraints we just saw may seem somewhat mysterious, but they too have a 
simple and clear justification in terms of equational rewriting (see section EJ. 
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Now, chasing (twice) with (INVl), then with (KEY2) and finally with (KEYl), 
transforms (ED-RETRO) into 

(TRIV) V(p G Proj) V(d G depts) V(s G d.DProjs) 

V(p' G Proj) V(d' G depts) V(s' G d.DProjs) 
s = p.PNcune A p.PDept = d.DNcune A 
s' = p'.PNajne A p'.PDept = d'.DName A 
p.PNajne = p'.PNajne A p = p' A 

d = d! ^ d.MgrNcune = d'.MgrName 

(Along the way, some simple facts about conjunction and equality came into 
play.) But (TRIV) is a constraint that holds in all instances, regardless of 
other constraints (such a constraint or dependency is traditionally called trivial). 
Therefore, (ED-RETRO) follows from (INVl, KEY2, KEYl). 

How general is this method? The rest of the paper is concerned with propos- 
ing an answer to this question. We give first an overview, then we use our inter- 
nal framework to justify the soundness of the method illustrated in the previ- 
ous examples, and finally using path- conjunctions we identify a class of queries, 
constraints and views for which this method leads to decision procedures and 
complete equational reasoning. 



2 Overview (of the Rest of the Paper) 



In section 01 we present some aspects of our internal framework, called CoDi H. 
This is a language and equational theory that combines a treatment of dictio- 
naries with our previous work [HHWh2pHlNTWh,’ifbl'h7) on collections and ag- 
gregates using the theory of monads. In addition we use dictionaries (finite func- 
tions), which allow us to represent oodb schemas and queries. While here we focus 
on set-related queries, we show elsewhercfl that the (full) CoDi collection, ag- 
gregation and dictionary primitives suffice for implementing the quasi-totality of 
ODMG/ODL/OQL |H^ . Using boolean aggregates, CoDi can represent con- 
straints like the ones we used in the examples in section Q as equalities between 
boolean- valued queries. We also give the basic equational laws of CoDi while the 
rest of the axiomatization can be found in iPTMj . In section 0 we also show 
how that the transformations shown in section Q correspond to precise rewrites 
in CoDi. This suggests the definition of a general notion of chase by rewriting, 
which connects dependencies to query equivalence (and therefore containment 
via intersection) . This generalizes the relational chase of |AHII79IM MS7hlHV84j 
for conjunctive queries (with equality) |CM77IASlJ79j and embedded dependen- 
cies |Eag82| . In the same section we discuss composing dependencies with views. 
Our main results are in sections0andEl In section 2]we exhibit a class of queries 



® From “Collections and Dictionaries” 

^ “An OQL interface to the K2 system”, by J. Crabtree, S. Marker, and V. Tannen, 
forthcoming. 
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and dependencies on complex values with dictionaries called path-conjunctive 
(PC queries and embedded PC dependencies (EPCDs)) for which the methods 
illustrated in earlier sections are complete, and in certain cases decidable. The- 
orem Q in section 2] extends and generalizes the containment decidability/NP 
result of |( iM77| . Theorem 0 and corollary^ in section 0 extend and generalize 
the corresponding results on the chase in IBV84I . Theorem ^ and theorem 0 
also state that CoDi’s equational axiomatization is complete for deriving de- 
pendencies and containments, which extends and generalizes the corresponding 
completeness results on algebraic dependencies from lYPH^IAbiHdl , although in a 
different equational theory. Proposition ^ which in our framework is almost “for 
free” immediately implies an extension and generalization of the corresponding 
SPC algebra |AH V^5) result of fK PS2j (but not the SPCU algebra result). 

Due to the space restrictions, the proofs of the theorems stated here and 
some technical definitions are in the companion technical report IEI 23 . This 
report also contains more results (that we summarize in section 0) and more 
examples. Related work and further investigations are in section 0 



3 Chase and View Composition in CoDi 

We want to “rewrite” queries using constraints and we want to “compose” views 
with constraints. But constraints are logical assertions and queries and views 
are functional expressions (at least when we use OQL, SQL, or various alge- 
bras). The reconciliation could be attempted within first-order logic using ap- 
propriate translations of queries and views. Since we deal with complex values 
(nested sets and records) and with 00 classes, some kind of “flattening” encoding 
(eg. ILS97I ') would be necessary. Here we take the opposite approach, by trans- 
lating constraints into a functional algebra, CoDi where queries and views are 
comfortably represented. We shall see that the fragment of logic that we need 
for constraints also has a natural representation. 

The three major constructs in our functional algebra are the following (given 

here with their semantics, for S' {ai, . . . , a„}): 

n 

BigU {x G S) R{x) U R(oi) M (a; G -S') B{x) 

i^l 

n 

Sqrne (a; G S) R(x) \J B{ai) 

R{x) and S are set expressions, i.e. with types of the form {a} and B{x) is a 
boolean expression. The variable x is bound in these constructs (as in lambda 
abstraction) and we represent by R{x) and B(x) the fact that it may occur in 
R and B. B{a),R{a) are the results of substituting a for x. We shall use the 
generic notation Loop to stand for BigU, or Some and we shall consider 
mostly expressions of the form 



" Abm 

2=1 



Loop (x € S') if B{x) then E{x) 
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which is an abbreviation for 

Loop (a;i S S'!) • • ■ Loop (xn G Sn{xi,. . . ,x„_i)) 

if B{xi, . . . , Xn) then E{x \^ . . . , Xn) else null 

where null is another generic notation 0 standing for empty in a BigU, for true 
in an M and for false in a Some . We shall also use sngif, denoting a singleton 
set. The syntax for tuples (records) is standard. We also use an equality test: 
eq(ifi, E 2 ) and boolean conjunction. As for expressive power (so far), note that 
BigU is the operation ext/^ of jfilN rWifTrj . shown there to have (with singleton 
and primitives for tuples and booleans) the expressive power of the relational al- 
gebra over flat relations and the expressive power of the nested relational algebra 
over complex objects. 

Dictionaries We denote by cr x> t the type of dictionaries (finite functions) 
with keys of type a and entries of type r. dom M denotes the set of keys (the 
domain) of the dictionary M. K ! M denotes the entry of M corresponding to 
the key K (lookup!). This operation fails unless K is in dom M and we will 
take care to use it in contexts in which it is guaranteed not to fail. We model 
00 classes with extents using dictionaries whose keys are the oids. The extents 
become the domains of the dictionaries and the implicit oid dereferencing in 
OQL is translated by a lookup in the dictionary. For example, depts d and 
d.DProjs s in ABCJDP, sectiond become d G dom Dept and s G d ! Dept.DProjs, 
see below. 

Example. Refering to the schema in figure Q], the type specs translate in 
CoDi as 

Proj : {(PNarnie : string, CustName : string, PDept : string)} 

Dept : Doid x> (DName : string, DProj s : {string}, MgrName : string) 

the query ABCJDP translates in CoDi as the expression 

BigU (d G dom Dept) BigU (s G d ! Dept. DProj s) BigU (p G Proj ) 

if eq(s, p.PNcune) and eq(p.CustNamie, "ABC") then sng(PN : s, DN : d ! Dept.DNaune) 

and the constraint (INV2) translates in CoDi as the equation 

All (v G Proj)M (d G dom Dept) 

if eq(p.PDept, d ! Dept.DNcune) then Some (s G d ! Dept. DProj s) eq(p.PName, s) 
= true 

® These generic notations are not ad-hoc. We show in [Ll'QTj that they correspond to 
monad algebras which here are structures associated with the set monad and are 
“enriched” with a nullary operation. 
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Expressing constraints this way will allow us to achieve both our objec- 
tives: rewriting queries (and constraints!) with constraints and composing views 
with constraints. All our manipulations are justified by CoDi’s equivalence laws 
and we show the basic ones in figure 13 Some of these laws are derived from 
the theory of monads and monad algebras and the extensions we worked out 
in IRNTWh.^lbTiTl . The important new axiom is (idemloop). Its validity is based 
on the fact that x does not occur in E and that union, conjunction, and disjunc- 
tion are all idempotent operations. The rest of the equational axiomatization is 
in IFTH8I . Some laws, such as (commute) and (from IPl DHI l the laws governing 
eq, the conditional, and , and the congruence laws are used rather routinely to 
rearrange expressions in preparation for a more substantial rewrite step. When 
describing rewritings we will often omit mentioning these ubiquituous rearrange- 
ments. 



(sng) 


BigU {x £ S) sng(a:) = S 


(monad-/3) 


Loop {x £ sng(F)) E'{x) = E'{E) 


(assoc) 


Loop {x £ (BigU {y £ R) S{y))) E{x) = Loop {y £ R) (Loop {x £ S{y)) 




E{x)) 


(null) 


Loop {x £ empty) E{x) — null 


(cond-loop) 


Loop (a; £ S') if R then E{x) = if R then Loop (x £ S) E{x) 


(commute) 


Loop (x £ R) Loop (y £ S) E(x, y) = Loop {y £ S) Loop {x £ R) 




E{x,y) 


(idemloop) 


Loop {x £ S) if B(x) then R = if Some (x £ S) B(x) then E 



Fig. 2. Equivalence laws 

Example. We show how we obtained (ED-RETRO) from (ED) (see sec- 
tionQI). First, we express UJP_View in CoDi: 

BigU {p € Proj) BigU (d £ doniDept) BigU (s £ d! Dept.DProj s) 

jf eq(s, p.PName) then sng(PN : p.PName, CN : p.CustName, MN : d I Dept.MgrName) 

Translating (ED) into CoDi (as we did for (INV2)), then substituting the first 
occurrence of UJP_View with its definition and using (assoc) we obtain: 

All iv £ Proj) AN (d £ dom Deptl All is £ d I Dept.DProj s) 

All (v £ jf eq(s, p.PName) then sng(PN : p.PName,CN : p.CustName, MN : d ! Dept.MgrName)) 
Alj(t;' £ UJP_View) if eq(u.PN, uLpN) then eq(u.MN, u'.MN) 



true 
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After some manipulations of conditionals we apply, in sequence, (monad-/?), 
(null) and (cond-loop), and obtain: 

All (v G Proj)^(d G dom Dept) All (s G d!Dept.DProjs)^(r;' G UJP_View) 

if eq(s, p.PName) and eq(p.PName, f'.PN) then eq(d ! Dept.MgrNcune, f'.MN) 

= true 

A similar rewriting in which the other occurrence of UJP_View is substituted 
with its definition yields the direct translation in CoDi of (FD-RETRO). Note 
that these transformations explain tableau view “replication” (KEH2I. 

The form in which we have written the constraints turns out to be convenient 
if we want them rewritten but not if we want to rewrite with them. In order to 
replace equals by equals within the scope of Loop(a; G S) bindings, we will use a 
form of equations “guarded” by set membership conditions for the variables: 

Xi G G S„{xi,...,Xn-l) b Ei{xi,...,Xn) = E2(Xi , . . . , Xn) 

The congruence rules for the Loop construct together with some additional rules 
that can be found in allow us to rewrite the set membership conditions 

within the scope of ^ and vice-versa. Thus we have the following: 

Lemma 1 (Two Forms for Constraints). The following two equations are 
derivable from eaeh other in CoDi’s equational theory: 

All jx G S) if B(x) then C(x) = true 

X G S \- B{x) = B{x) and C(x) 

The Equational Chase Step. Consider the constraint (d) and the expres- 
sion G: 



{d) X G Ri F Bi{x) = Bi{x) and Some (v G R^.ix)) B^ix. y) 

G Loop (x S i?i) if Ri(a:) then E(a;) 

Rewriting G with (d) we obtain 

Loop (x G Ri) if Bi{x) and Some (v G R^ix)) B2{x, y) then E{x) 
Commutativity of conjunction and nesting of conditionals yields 

Loop (x G Ri) if Some {y G R2{x)) B2{x, y) then ff Bi(x) then E{x) 
Finally, applying (idemloop) from right to left gives 

G' Loop (x e ill) Loop (y G i?2(a;)) if Ri(x) and i?2(aJj y) then E(x) 
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Therefore, the equation G = G' is derivable in CoDi’s equational theory (and 
hence follows semantically) from (d). We say that G' is the result of (one-step) 
chasing G with (d). G could be a query but it could also be the boolean- valued 
expression A when we translate constraints as equations A = true . Chasing a 
constraint means chasing the expression A. 



Query Containment. The chase step illuminates the connection between 
query containment and constraints (in the relational case a similar connection 



was put in evidence in 

def 






Q^ 



). Consider 

BigU (x G Ri) if Bi(x) then sng(ifi(a;)) (i = 1, 2) 



and observe that we can express intersection by 



Qi n Q 2 = BigU (x € Ri) BigU (y € R 2 ) 

if Bi{x) and B2{y) and eq(ifi(a;), E2{y)) then sng(ifi(a;)) 

Now consider the constraint cont(Qi, Q 2 ) defined as 

xgRi L Bi{x) = Bi (x) and Some (v g R-? )^2(y)^eq(Gi(a;), E 2 {y)) 

It is easy to see that one step of chasing Qi with cont(Qi, Q 2 ) yields Qi C Q2- 
Since Qi C Q2 is equivalent to Qi = QinQ2) it follows that containments can be 
established by deriving certain constraints. In fact, we show in sections 01 and 0 
that for path-conjunctive queries containment is reducible to path-conjunctive 
dependencies. 

The Relational Case. Conjunctive (tableau) queries |AHV95| are easily 
expressible in CoDi (using variables that range over tuples rather than over 
individuals, much as in SQL and OQL). Embedded dependencies (tuple- and 
equality-generating) are expressible just as we have expressed the constraints of 
the schema in figure dH. It turns out that mirroring in CoDi one step of the 
relational chase of a tableau query Q (or another embedded dependency) with 
an embedded dependency (d) corresponds exactly to the equational chase step 
that we have described above! Moreover, the special case of chasing with trivial 
dependencies explains equationally tableaux query containment and tableaux 
minimization. When cont{Qi, Q2) is trivial, it corresponds precisely with a con- 
tainment mapping i ITTTTTl i . We show in section^ (and in full details in EinHi) 
that this is the case for the more general class of path-conjunctive queries and 
dependencies. Moreover, we show that CoDi is complete for proving triviality 
and containment/equivalence for this class. 

® Since we are using tuple variables, full relational dependencies still require existential 
quantification in CoDi. With extra care, a generalization of full dependencies is still 
possible (section EJ. 
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4 Path-Conjunctive Queries, Dependencies, and Views 

A schema consists simply of some names (roots) R and their types. An instance 
consists of complex values (with dictionaries) of the right type for each root 
name. We will distinguish between finite and unrestricted instances, where {cr} 
means all sets. We now define: 

Paths: P ::= a: I R I P.k \ dom P | a: ! P | true | false 

Path- Conjunctions: C ::= eq(Pi, P[) and • • • and eq(P„, P))) 

Path- Conjunctive (PC) Queries: Q ::= BigU (a; G P)]f C{x) then sng(P(a;)) 

where 0 : E ::= P \ {Ai : Pi, . . . , A„ : Pn) 

Path- Conjunctive View: a view expressed using a PC query. 

Embedded Path-Conjunctive Dependency (EPCD): 

d ::= All ix € Pi 1 if Ci (a;l then Some iv € P-?.(x)) Cilx. v) = true 
or equivalently (see lemma P): 

d ::= £C G Pi h Ci(a:) = Ci(a;) and Some (v G P->.(x)) Ciix. v) 

Equality- Cenerating Dependency (ECD): an EPCD of the form 

All (x G Pi) if Cl (x) then Ci(x) = true 

Restrictions. All PC queries and EPCDs are subject to the following re- 
strictions. 

1. A simple type is defined (inductively) as either a base type or a record type 
in which the types of the components are simple types (in other words, it 
doesn’t involve set or dictionary types) . Dictionary types a x> t are re- 
stricted such that (7 is a simple type. The paths Pj,P/ appearing in path 
conjunctions must be of simple type. The expression E appearing in PC 
queries must be of simple type. 

2. A finite set type is a type of the form {r} where the only base type occurring 
in T is bool or () (the empty record type). We do not allow in PC queries or 
EPCDs bindings of the form x £ P such that P is of finite set type 0 

3. x \ P can occur only in the scope of a binding of the form x G dom P H. 

^ Although we don’t allow record constructors in paths, their equality is still repre- 
sentable, componentwise. 

® Finite set types cause some difficulties in our current proof method. However there 
are more serious reasons to worry about them: it is shown in I^NTWDSI that they 
can be used to encode set difference, although the given encoding uses a language 
slightly richer than that of PC queries. 

® This restriction could be removed at the price of tedious reasoning about partiality, 
but we have seen no need to do it for the results and examples discussed here. 
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Views. The following shows that composing EPCDs with PC views yields 
EPCDs. Thus all subsequent results regarding implication/ triviality of EPCDs 
can be used to infer implication/ triviality of EPCDs over views as well. 

Proposition 1. Composing an EPCD over a schema S with a PC view from R to 
S yields an expression that is provably equivalent to an EPCD over R. Moreover 
composing ECDs with PC views gives expressions provably equivalent to ECDs. 

The main result of this section is about containment of queries and triviality 
(validity, holds in all instances) of dependencies. 

Theorem 1 (Containment and Triviality). Containment of PC queries and 
triviality of EPCDs are reducible to each other. Both hold in all (unrestricted) 
instances iff they hold in all finite instances. Both are decidable and in NP. 
CoDi’s equational axiomatization is complete for deriving them. 

As for relational tableau queries, the proof of the theorem relies on a re- 
duction to the existence of certain kinds of homomorphisms between tableaux. 
Except that here we must invent a more complex notion of tableau that takes 
set and dictionary nesting into consideration. In CoDi, the (PC) tableau corre- 
sponding to the PC query BigU (a; G P) if C{x) then sng(E(a;)) is just a piece 
of syntax consisting of several expressions T ::= {a: G P ; C{x)}. Note that 
relational tableaux can also be represented this way, with the variables in x cor- 
responding to the rows. We define a homomorphism between two tableaux to be a 
mapping between variables satisfying certain PTIME-checkable conditions. The 
precise definition is given in jETHHl. A basic insight about relational tableaux is 
that they can be considered themselves as instances. In the presence of nested 
sets and dictionaries, we must work some to construct from any tableau T a 
canonical instance, Inst(T) such that a valuation from T2 into Inst(Ti) induces 
a homomorphism from T2 to T\. 0 Deciding containment then comes down to 
testing for the existence of a homomorphism, which is in NP. Since the problem 
is already NP-hard in the relational case rwn . containment of PC queries is 
in fact NP-complete. The full details are given in Eng. As for the reductions 
between containment and triviality, we show that for PC queries Q\ C Q2 iff 
cont{Qi,Q2) is trivial, hence triviality of EPCDs is also NP-hard. Conversely, 
we associate to an EPCD (d) two queries 0 , where A are fresh labels: 

fronted) '=^ BigU (x G Pi) if Ci(a;) then sng(A : x) 

back{d) BigU(a; G Pi)BigU(y G P^ix)) \j_Ci{x) and C-yix.v) 

then sng(A : x) 

and we show that (d) is trivial iff frontid) C back{d) , hence triviality of EPCDs 
is NP-complete. 

Intuitively, InstfT) is the minimal instance that contains the ’’structure” of T, and 
we can express syntactical conditions on T as equivalent conditions on Inst{T). 
Since x may have non-simple type variables, these queries do not obey the restric- 
tion we imposed for PC queries. Nonetheless, their containment is decidable. This 
encourages the hope that the simple type restriction can be removed for this theorem. 
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In the case of EGDs, we can improve on complexity and we can even remove 
the restriction to simple types. 

Theorem 2 (Trivial EGDs). A non-simple EGD (with equality at non-simple 
types) holds in all (unrestricted) instances iff it holds in all finite instances. Triv- 
iality of non- simple EGDs is decidable in PTIME. GoDi’s equational axiomati- 
zation is complete for deriving all trivial non-simple EGDs. 

5 Path-Conjunctive Chase 

The chase as defined in section 0 is only a particular case. In general we need 
to be able to map the bound variables to other variables. Moreover, we need 
a precise definition of when the chase step should not be applicable because it 
would yield a query or dependency that is trivially equivalent to the one we 
already had. As with relational tableau queries, it suffices to define the chase 
on tableaux: chasing a PC query means chasing the corresponding PC tableau, 
while chasing an EPCD (c?) means chasing the PC tableau that corresponds to 
front{d). For example, just for the particular case justified in section 0 chasing 
the tableau {a; G Pi ; Ci(a;)} with the dependency 

(d) M (a; G Pi) if Ci(x) then Some (y G P2(a;)) C2 (£c, y) = true 

yields the tableau {x G Pi,y G Pi{x) ; Ci(a:) and C2(a:,y)}. The complete 
definition of the chase step involves homomorphisms and due to lack of space we 
give it in jP 198] . Like the particular case we’ve shown, the complete definition 
is such that a chase step transforms a PC query (an EPCD) into one provably 
equal (equivalent) in the CoDi equational axiomatization. Most importantly, 
the complete definition is such that if Inst{T) ^ d then the chase step with d is 
applicable to T. 

A chase sequence with a set of EPCDs P is a sequence of tableaux obtained 
by successive chase steps each with some dependency d £ D (same d can be used 
repeatedly). We say that a sequence starting with T terminates if it reaches a 
tableau T' that cannot be chased with any d £ D (and therefore Inst{T') ^ D). 
Although in general T' depends on the choice of terminating chase sequence, 
we shall denote it by chaseniT) and extend the same notation to queries and 
dependencies. 

Theorem 3 (Containment /Implication by Chasing). Let D be a set of 

EPGDs. 

1 . Let Qi,Q 2 be PG queries such that some chasing sequence of Qi with D 
terminates (with chascDiQi)). The following are equivalent and the unre- 
stricted and finite version of each of (a)-(d) are equivalent as well: 

(a) Qi Cd Q2 (b) chaseD{Qi) C Q2 

(c) chaseD{cont{Qi,Q2)) is trivial (d) D |= cortt(Qi, Q 2 ) 

(e) cont{Qi,Q2) (hence Qi C Q2) is provable from D in GoDi 
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2. Let d be an EPCD such that some chasing sequence of d with D terminates. 
The following are equivalent and the unrestricted and finite version of each 
of (a)-(d) are equivalent as well: 

(a) D \= d (b) chaseoid) is trivial 

(c) chas CD (fronted)) C hacked) (d2) front{d) Qd hack{d) 

(e) d is provable from D in CoDi 

Full EPCDs. This is a class of dependencies that generalizes the relational 
full dependencies |AHV95I| (originally called total tgd’s and egd’s in |BV84j l. 
Since we work with “tuple” variables, the definition needs a lot more care than 
in the first-order case. 

Definition 1 (Full Dependencies). An EPCD 

^(r S il) if i?i(r) th^ Sprne (s G S'(r)) i? 2 (f’, s) = true 

is full if for any variable Si in s there exists a path Pi{r) such that the following 
EGD is trivial 

All (r g R) All is g S(r)) if (r) and B->{r. s') then eafs,-. PArX] = true 



Theorem 4 (Full Termination). If D is a set of full EPCDs and T is a PC 
tableau then any chase sequence of T with D terminates. 



Corollary 1. PC query containment under full EPCDs and logical implication 
of EPCDs from full EPCDs are reducible to each other, their unrestricted and 
finite versions coincide, and both are decidable. 



Proposition 2. Let D be a set of full EPCDs and (d) be another EPCD. Let T 
be the tableau part of (d) that gets chased. Then deciding whether D \= d can be 
done in time 

Cl \D\ (n/r)^2^* 

where c\ and ci are two constants, \ D \ is the number of EPCDs in D, n is the 
number of variables in T , s is the maximum number of variables of an EPCD 
in D, while w and h are two schema parameters: w is the maximum number of 
attributes that a record type has in the schema (including nested attributes at 
any depth) and h is the maximum height of a type in the schema. 

Therefore, the complexity of the chase decision procedure for full EPCDs is 
not worse than in the relational subcase mm (recall also the lower bound 
of |CLM8l| i. For EGDs, which are always full, it is easy to see that the problem 
is actually in PTIME, as in the relational case. 

The chase with full EPCDs also enjoys the following nice properties: 
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Theorem 5 (Confluence). Consider two terminal chase sequences of a PC 
tableau T with a set D of full EPCDs, ending in T\ and T2 respectively. Then 
InstifPi) o,nd Inst(fP2) must be isomorphic. 

Note that we cannot hope that Ti and T 2 are “equal” (even modulo variable 
renaming) because the path-conjunctions may be different, although logically 
equivalent. 

Proposition 3 (Semantic Invariance). Let D\ and D2 be two semantically 
equivalent (hold in the same instances) sets of full EPCDs and let T\ and T2 
be the resulting tableaux of two arbitrary terminal chase sequences of a given 
tableau T with Di and, respectively, D2. Then Inst(Ti) and Inst(T2) must be 
isomorphic. 



6 Other Results 

The companion technical report contains several other results that space 

restrictions prevent us from outlining here. In particular, we show in 
how to extend and generalize the complete proof procedure result of Beeri and 
Vardi mvM for the case when the chase may not terminate. The result also 
applies to query containment. Note that since we are not in the first-order case, 
even the r.e.-ness is non-obvious. The equational axiomatization of CoDi is com- 
plete in this case too. Moreover, we show in ip™ that the containment parts 
of theorems 0 and 0 (as well as the result for the non-terminating chase) also 
hold, with similar proofs, for boolean- valued- Some queries in PC form, where 
containment means boolean implication. 

We have shown how to use dictionaries to model 00 classes but in fact they 
are much more versatile. In |P 198] we give examples of queries, views, and con- 
straints that use dictionaries to model indexes. We also use the chase to char- 
acterize nesting of relations into nested sets or dictionaries, as well as unnesting 
of the corresponding structures. 



7 Related Work and Further Investigations 

Related work. The monad algebra approach to aggregates pi i I is related to 
the monoid comprehensions of but it is somewhat more general since 

there exist monads (trees for example) whose monad algebras are not monoids. 
The maps of |AI;PKtil^ . the treatment of object types in [HKh3IJ and in [I )H Ph7j . 
that of views in jdSDA94| . and that of arrays in piWl W9Kj are related to our use 
of dictionaries. An important difference is made by the operations on dictionaries 
used here. 

The idea of representing constraints as equivalences between boolean-valued 
(OQL actually) queries already appears in [FRV96j . The equational theory of 
CoDi proves almost the entire variety of proposed algebraic query equivalences 
beginning with the standard relational algebraic ones, and including |S/89ap . 
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|S/89hl( d )H2I( iliiH1IH’MH5hlF’IVUi5'^ and the very comprehensive work by Beeri 
and Kornatzky p3K93j . Moreover, using especially (commute), CoDi validates 
and generalizes standard join reordering techniques, thus the problem of join as- 
sociativity in object algebras raised in rrm does not arise. Our PC queries are 
less general than COQL queries csnzi, by not allowing alternations of condition- 
als and BigU. However we are more general in other ways, by incorporating dic- 
tionaries and considering constraints. Containment of PC queries is in NP while 
a double exponential upper bound is provided for containment of COQL queries. 
In IBid87l it is shown that containment of conjunctive queries for the Verso com- 
plex value model and algebra is reducible to the relational case. Other studies 
include semantic query optimization for unions of conjunctive queries ICOM881 . 
containment under Datalog-expressible constraints and views insnni, and con- 
tainment of non-recursive Datalog queries with regular expression atoms under 
a rich class of constraints [( Xtl;t)8j . We are not aware of any extension of the 
chase to complex values and oodb models. Hara and Davidson [HD98j provide 
a complete intrinsic axiomatization of generalized functional dependencies for 
complex value schemas without empty sets. Fan and Weinstein |FW98j examine 
the un/decidability of logical implication for path constraints in various classes 
of oo-typed semistructured models. 

Further investigations. We conjecture that the simple type restriction 
can be removed without affecting the containment /triviality result (Theorem 
When equality at non-simple types is allowed, the chase is incomplete. However, 
the chase seems to be able to prove the two containments that make a set or 
dictionary equality. This suggests that a complete proof procedure might exist 
that combines the chase with an extensionality rule. Another important direction 
of work is allowing alternations of conditionals and BigU and trying to extend 
the result of psnu from weak equivalence to equivalence. The axiomatization 
of inclusions in can be soundly translated into CoDi’s equational theory. 

We conjecture that CoDi is a conservative extension of this axiomatization. 
An interesting observation is that the equational chase does not require the 
PC restrictions, but just a certain form of query and dependency. It is natural 
to ask how far we can extend the completeness of this method. Most EPCDs 
in our examples are full. Some of those who are not may be amenable to the 
ideas developed for special cases with inclusion dependencies |JK$4ICRV!)0| . 
Another question regards the decidable properties of classes of first-order queries 
and sentences that might correspond (by encoding, eg. lEsnzi) to PC queries 
and EPCDs. Other encodings might allow us to draw comparisons with the 
interesting results of EHEM!. 

It is an intriguing question why the powerful relational optimization tech- 
niques using tableaux and dependencies have not made their way into commer- 
cial optimizers. There seem to be two reasons for this. One is that queries crafted 
by users tend not to introduce spurious joins and tend to take advantage of cer- 
tain constraints (even implicit ones!). The other reason is that the techniques 
have generally exponential algorithms. But do these reasons carry through to the 
paradigm of distributed mediator-based systems that interests us? We have al- 
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ready argued that in such systems there are a lot of (secondary) queries that are 
generated by not-so-smart software components. The complexity issue remains a 
serious one and only an experimental approach might put it to rest. Recall that 
the algorithms are exponential in the size of queries and dependencies, not of 
data. Note also that standard relational optimization using dynamic program- 
ming is also exponential in theory, yet practical. Finally, some work on PTIME 
subcases exists ICb.HTIbarbll and might be extended. On a more practical note, 
rewriting with individual CoDi axioms generates too large a search space to be 
directly useful in practical optimization. An important future direction is the 
modular development of coarser derived CoDi transformations corresponding to 
various optimization techniques in a rule-based approach. 

Anecdote We were happily proving equalities in CoDi by rewriting with de- 
pendencies and (idemloop) for quite some time before we realized the connection 
with the chase! 

Many thanks to Serge Abiteboul, Peter Buneman, Sophie Cluet, Susan 
Davidson, Alin Deutsch, Wenfei Fan, Carmem Kara, Rona Machlin, Dan Suciu, 
Scott Weinstein, and the paper’s reviewers. 
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Abstract. We study the query language BQL: the extension of the re- 
lational algebra with for-loops. We also study FO(FOR): the extension 
of first-order logic with a for-loop variant of the partial fixpoint operator. 
In contrast to the known situation with query languages which include 
while-loops instead of for-loops, BQL and FO(FOR) are not equivalent. 
Among the topics we investigate are: the precise relationship between 
BQL and FO(FOR); inflationary versus non-inflationary iteration; the 
relationship with logics that have the ability to count; and nested versus 
unnested loops. 



1 Introduction 



Much attention in database theory (or finite model theory) has been devoted to 



eisv^iaiagH 



. A seminal pa- 



extensions of first-order logic as a query language 
per in this context was that by Chandra in 1981 , where he added various 

programming constructs to the relational algebra and compared the expressive 
power of the various extensions thus obtained. One such extension is the lan- 
guage that we denote here by BQL: a programming-language-like query language 
obtained from the relational algebra by adding assignment statements, compo- 
sition, and for-loops. Assignment statements assign the result of a relational 
algebra expression to a relation variable; composition is obvious; and for-loops 
allow a subprogram to be iterated exactly as many times as the cardinality of 
the relation stored in some variable. 

For-loops of this kind have since received practically no attention in the 
literature. In contrast, two other iteration constructs, namely least or inflationary 
fixpoints, and while-loops, have been studied extensively. In the present paper 
we will make some steps toward the goal of understanding for-loops in query 
languages as well as fixpoints and while-loops. 
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The variant of BQL with while-loops instead of for-loops, called RQL, was 
introduced by Chandra and Harel EHHa. In the same paper these authors also 
introduced, in the context of query languages, the extension of first-order logic 
with the least fixpoint operator; we will denote this logic here by FO(LFP). By 
using a partial fixpoint operator we obtain a logic, called FO(PFP), with the 
same expressive power as RQL. 

Here, we will introduce the FOR operator, which iterates a formula (called the 
“body formula” ) precisely as many times as the cardinality of the relation defined 
by another formula (called the “head formula”). In contrast to the equivalence 
of RQL and FO(PFP), FO(FOR) is not equivalent to, but strictly stronger than, 
BQL. The reason for this turns out to be the presence of free variables acting 
as parameters; the restriction of FO(FOR) that disallows such parameters is 
equivalent to BQL. 

The question whether FO(LFP) is strictly weaker than FO(PFP) is a fa- 
mous open problem, since Abiteboul and Vianu showed that it is equivalent 
to whether PTIME is strictly contained in PSPACE IAV95I . In FO(LFP) we 
can equivalently replace the least fixpoint operator by the inflationary fixpoint 
operator IFP. So the PTIME versus PSPACE question is one of inflationary 
versus non-inflationary iteration. Since the FOR operator is non-inflationary in 
nature, one may wonder about the expressive power of the inflationary version 
of FOR, which we call IFOR. We will show that FO(IFOR) lies strictly between 
FO(IFP) and FO(FOR). Since in FO(FOR) we can define parity, FO(FOR) is 
not subsumed by FO(PFP), and conversely FO(PFP) can only be subsumed 
by FO(FOR) if PSPACE equals PTIME, since FO(PFP) equals PSPACE on 
ordered structures and FO(FOR) is contained in PTIME. 

A natural question is how FO(FOR) relates to FO(IFP, ^), the extension of 
FO(IFP) with counting. Actually, FO(FOR) is readily seen to be subsumed by 
FO(IFP, We will show that this subsumption is strict, by showing that one 
cannot express in FO(FOR) that two sets have the same cardinality!] We also 
will show that the restriction of FO(IFP, that allows modular counting only, 
is strictly subsumed by FO(FOR). 

The main technical question we will focus on in this paper is that of nesting 
of for-loops. It is known that nested applications of while-loops in RQL, or of the 
PFP operator in FO(PFP), do not give extra expressive power; a single while- 
loop or PFP operator suffices |EF^ . In the case of BQL, however, we will show 
that nesting does matter, albeit only in a limited way: one level of nesting already 
suffices. In the case of FO(FOR), there are two kinds of nesting of the FOR 
operator: in body formulas, and in head formulas. Regarding bodies, we will show 
that nested applications of the FOR operator in body formulas again do matter, 
although here we do not know whether nesting up to a certain level is sufficient. 
Regarding heads, we will show that the restriction of FO(FOR) that allows 
only head formulas that are “pure” first-order, is weaker than full FO(FOR). 
By “pure” we mean that the formula cannot mention relation variables from 

^ The analogous result for BQL (which is weaker than FO(FOR)) was stated by Chan- 
dra in the early eighties |( hia.SIK ;ha,SM| . but no proof has been published. 
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surrounding FOR operators; from the moment this is allowed, we are back to 
full FO(FOR). 

In this extended abstract, the proofs of our results are only sketched. 

2 Preliminaries 

Throughout the paper we will use the terminology and notation of mathematical 
logic IFFTD4I . A relational vocabulary t is what in the field of databases is known 
as a relational schema; a structure over r is what is known as an instance of that 
schema with an explicit domain. (Structures are always assumed to be finite in 
this paper.) The query language of first-order logic (the relational calculus) is 
denoted by FO. 

Let us briefly recall the syntax and semantics of FO(PFP) and FO(IFP). Let 
tp{x,y,X,Y) be an FO formula over r U {X^Y}, where X is an n-ary relation 
variable, x is of length n, and R is a tuple of relation variables. On any r-structure 
A expanded with interpretations for the first- and second-order parameters y and 
Y, ip deflnes the stages p^{A) 0 and p^{A) := {a | ^ p[a,p^~^{A)W for 

each j > 0. If there exists an iq such that p'^°{A) = p‘°'^^{A), then we say that 
the partial fixpoint of p on A exists ^ and define it to be p'‘°{A), otherwise we 
define it as the empty set. We obtain FO(PFP) by augmenting FO with the 
rule [PFP 2 _x 9 ?](^, which expresses that t belongs to the partial fixpoint of p. 
For FO(IFP) we consider the stages p'^{A) := 0 and p^A) := p'‘{A) U {a | 
A ^ p[a,p'^~^{A)\}, for each i > 0. Here, there always exists an io such that 
p^°{A) = p'^°^^{A). We call p^°{A) the inflationary fixpoint of p on A. We 
obtain FO(IFP) by augmenting FO with the rule [IFP2_x</3](f), which expresses 
that t belongs to the inflationary fixpoint of p. 

3 Query Languages with For-Loops 

Let T be a vocabulary. The set of BQL programs over r is inductively defined as 
follows: 

(i) if X is a relation variable of arity n and e is a relational algebra expression 
of arity n over the vocabulary r and the relation variables, then X := e is a 
BQL program; 

(ii) if Pi and P 2 are BQL programs then Pi; P 2 is a BQL program; and 

(iii) if P is a BQL program then for \X\ do P od is a BQL program. 

The semantics of BQL programs of the form (i) or (ii) is defined in the obvious 
way; for BQL programs of the form (iii) the subprogram P is iterated as many 
times as the cardinality of the relation stored in variable X prior to entering the 
loop. 

Example 1. Let tq = {E} be the vocabulary of graphs; so E is the binary edge 
relation. Consider the following BQL programs: 



X ■= {x\x = a;}; Y := 0; for |X| do Y := -~Y od. 
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X := E\ for |X| do X := {{x,y) \ X{x,y) V (3z)(X(x,z) A E{z,y))} od. 

The first program computes, in variable Y, the parity of the number of vertices 
of the graph, and the second program computes the transitive closure of E. 
Formally we should have used relational algebra expressions in the assignment 
statements, but we use relational calculus formulas instead. □ 

By a standard simulation technique , one can simulate every FO(IFP) 

formula by a BQL program. It is well known that it is not expressible in FO(PFP) 
whether the cardinality of a set is even. Hence, FO(PFP) does not subsume BQL 
and BQL strictly subsumes FO(IFP). 

We next introduce the logic FO(FOR). The crucial construct in the formation 
of FO(FOR) formulas is the following: suppose ip{x, y,X,Y) and '0(z, u,y,Y) are 
formulas and x, u and X are of the same arity. Then the following FO(FOR) 
formula ^ is obtained from ijj and tp through the FOR-constructor: 

au,y,Y) = [FOR*^ip]{u). 

The formula tp is called the head formula, and ip is called the body formula of f. 
For each r-structure A expanded with interpretations for the parameters y and 
Y , and for any tuple of elements d\ A \= ^[o] iff a G p^{A) where m equals the 
cardinality of the set {c \ A \= 4’\c,d]} . Here is as defined in Section 2. 

Example 2. Consider the following FO(FOR) formulas over tq-. 

= [VOR*"y*^x^"'*^E{x,y)y (3z)(X(x,z) ^E{z,y))\(u,v) 

and 

6(a:) = 

The formula defines the transitive closure of E, and ^2 expresses that vertex 
X has even outdegree. □ 

Note that all queries definable in FO(FOR) are in PTIME. We will show 
that there are PTIME queries that are not definable in FO(FOR). However, 
for every PTIME query Q on graphs there is a formula p G FO(FOR) such 
that Q(G) ^(G) for a vanishingly small fraction of n element graphs G. 
Indeed, Hella, Kolaitis and Luosto showed that a canonical ordering is definable 
on almost all graphs in FO(IFP) plus the even quantifier pTKIjflfij . For future 
reference we call the query that expresses this ordering the HKL query. Clearly, it 
is definable in FO(FOR), too. Since FO(FOR) can also easily simulate FO(IFP), 
which is known to capture PTIME in the presence of ordering, FO(FOR) thus 
captures PTIME on almost all graphs. 

3.1 BQL versus FO(FOR) 

Every BQL query is also definable in FO(FOR). The converse, however, does 
not hold. For non-Boolean queries this is readily seen. The relation variables in 
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a BQL program always hold relations that are closed under indistinguishability 
in first-order logic with a fixed number of variables. To see this, let P be a BQL 
program and let k be the maximum number of variables needed to express the 
assignment expressions of P in FO. Then, for any structure A, the value of any 
relation variable of P is definable by an FO^^-formula. (FO^^ denotes the 2k- 
variable fragment of FO.) The query defined by ^2 in Example 0 however, is not 
closed under FO^^-indistinguishability for any k. Indeed, let Qk be the graph 
depicted in Figure ^ No FO^^ formula can distinguish the node p from node p', 
but they are clearly distinguished by ^2 • 



Cl 



C2fc 



Cl 



Ofe+i 



Fig. 1. The graph Qk 



To separate BQL from FO(FOR) with a Boolean query, we need to do more 
work. Let Q\ be the query Is there a node with even outdegree? This query is 
definable in FO(FOR) by the sentence (3x)^2(a;). However, this query is not ex- 
pressible in BQL. Indeed, let S be the class of graphs that are disjoint unions of 
stars with the same number of children. Hence, each graph Qp^c G 5 is character- 
ized by the number of stars p and the number of children c. On S, there are, up 
to equivalence, only a finite number of FO^^ formulas. Hence, there are only a 
finite number of possible values for the relation variables of BQL programs. On 
large enough graphs the values of the relation variables in the bodies of for-loops 
will thus start to cycle. Moreover, this behavior can be described by modular 
equations: 

Lemma 3. For any natural numbers k and v, there exist polynomials gi, . . . , 
in p and c and natural numbers di, c ?2 such that for all p,p', c, c' > di : if 

Qj {P,c)= Qj {p' , c') (mod ^ 2 ) for j = 1...N, 

then for every BQL program P with at most k individual and v relational vari- 
ables, P accepts Qp^c iff P accepts Gp',c'- Moreover, every polynomial is of the 
form p ■ q{p,c). 

Now, suppose P is a BQL program with k individual and v relational vari- 
ables that computes Q\. Take p as a multiple of d^ and take c arbitrary but 
large enough. Then all above described polynomials are equivalent to 0 mod- 
ulo d2- Hence, P accepts Gp,2c iff P accepts Gp,2c+i- This leads to the desired 
contradiction. We thus obtain: 

Theorem 4. BQL is strictly weaker than FO(FOR). 
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It is natural to ask whether there is a fragment of FO(FOR) which is equiva- 
lent to BQL. The FO(FOR) formula ^2 of Example |2| uses an individual param- 
eter X in its head formula. Hence, to find a fragment of FO(FOR) equivalent to 
BQL, one could try to exclude individual parameters from head formulas. This, 
however, is not enough, since they can be simulated by relational parameters in 
the head and individual parameters in the body (details omitted). 

Proposition 5. The fragment of FO(FOR) that does neither allow individual 
parameters in heads nor in bodies of for-loops is equivalent to BQL. 

3.2 Inflationary versus Non- inflat ionary Iteration 

If we replace in the definition of the semantics of the FOR operator, the stages 
by (cf. Section 2), we obtain the inflationary version of FOR which we 
denote by IFOR. It is routine to verify that FO(IFOR) collapses to FO on sets. 
Hence, one cannot express in FO(IFOR) that the cardinality of a set is even. 
This implies that FO(IFOR) is strictly weaker than FO(FOR). On the other 
hand, FO(IFOR) is strictly more expressive than FO(IFP). Indeed, consider the 
vocabulary t = {U, i?} with U unary and R binary, and let Q 2 be the following 
query: Q 2 {A) is true iff is a chain and \U^\ > and is false otherwise 

(here \U^\ denotes the cardinality of the set U^). Using pebble games [EFf)5| . 
one can show that Q 2 is not definable in FO(IFP). However, it can be defined 
in FO(IFOR) by the formula 

chainA(Vz) (^{3y)R{y,z) [IFORfjf^^“^f irst(a;) V (3y){X{y) A R{y,x))]{z)^ , 

where chain is an FO(FOR) sentence saying that i? is a chain and first(a:) 
defines the first element of the chain. The above implies: 

Proposition 6. FO(IFOR) lies strictly between FO(IFP) and FO(FOR). 

3.3 A Comparison with Logics That Count 

Inflationary fixpoint logic with counting l0O9:J0tt96l . denoted by FO(IFP, #), 
is a two sorted logic. With any structure A with universe A, we associate the 
two-sorted structure A* := A U ({0, . . . , n}; <) with n = | A| and where < is the 
canonical ordering on {0, . . . ,n}. The two sorts are related by counting terms: 
let ip{x,y) be a formula, then ffxYp] is a term of the second sort. For any inter- 
pretation b for y, its value equals the number of elements a that satisfy :p(a,b)- 
The IFP operator can be applied to relations of mixed sort. 

Every FO(FOR) formula can be readily simulated in FO(IFP, and this 
subsumption is strict. 

Theorem 7. FO(FOR) is strictly weaker than FO(IFP, 

Proof, (sketch) Consider the vocabulary r = {U,V} with U and V unary. The 
equicardinality query evaluates to true on A if \U'^\ = \ V^\. It is easily definable 
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in FO(IFP,#) as #x\U{x)] = 4j=x\V{x)]. This query, however, is not definable in 
FO(FOR). The proof proceeds along the lines of the proof of Theorem 0 Struc- 
tures over T have only a constant number of automorphism types. This implies 
that there are only a constant number of FO(FOR)-definable relations (every 
FO(FOR)-definable relation is closed under all automorphisms of the structure 
at hand) . Hence, for large enough structures, the values of the relations defined 
by body formulas of for-loops start to cycle. This behavior can be described by 
modular equations similar to those of Lemma|3 More precisely, for any FO(FOR) 
sentence ^ there exist polynomials qi, . . . in n and m and natural numbers 
di and d .2 such that for all n,n' ,m,m' > di: if 

qj{n, m) = qj{n , m') (mod ^ 2 ) for j = 1 . . . N, 

then ^ 1= An,m iff C H Here, An,m is the r-structure where \U\^ = n, 

\V\-^ = TO, n V-^ — 0, and where the domain of A equals U V~^. Then 
^ H iff C h •^n,m for TO = 71 -|- ^2 and n> di. Hence, no FO(FOR) sentence 
can define the equicardinality query. □ 

We conclude this section with three remarks, 

(i) It can be shown that on monadic structures FO(FOR) collapses to FO plus 
modulo counting. 

(ii) Though FO(IFP,#) is strictly more expressive than FO(FOR), the logic 
FO(IFP) extended with modulo counting is strictly weaker than FO(FOR). 
Indeed, the query Q 2 from Section 3.2 is not expressible in FO(IFP) plus 
modulo counting. We omit the details. 

(iii) Note that FO(IFP, #) formulas are not evaluated on the r-structures them- 
selves but on the extension of these r-structures with a fragment of the natu- 
ral numbers. Hence, to be able to compare FO(FOR) with FO(IFP, in an 
honest way, we should make this second sort also available to FO(FOR) 
formulas. It now turns out that when FO(FOR) formulas are evaluated 
on these number-extended structures, FO(FOR) becomes in fact equal to 
FO(IFP,#). 

4 Nesting of For-Loops 
4.1 Nesting in BQL 

The nesting depth of a BQL program P, denoted by depth(P), is inductively 
defined as follows: 

(i) the depth of an assignment statement is 0; 

(ii) depth{Pi; P 2 ) := ma,x{depth{Pi), depth{P 2 )}; and 

(iii) depth{ior \X\ do P od) := depth{P) + 1. 

For 7 > 0, let BQLj be the fragment of BQL consisting of BQL programs of 
nesting depth smaller or equal to i. We also refer to BQL;^ as unnested BQL. 

We show: 
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Theorem 8 . Unnested BQL is strictly weaker than BQL. 

Proof, (sketch) Take the vocabulary r = {E, C}, with E binary and C unary. 
We now consider graphs of the form of a chain, where to each node of the chain 
a separate set is attached. The chain is distinguished in the structure by the 
unary relation C. The sets are of various sizes bounded above by the length of 
the chain. Let n be the length of the chain, and let ai be the size of the i-th 
set. Then this structure is denoted by a = (oi, . . . , a„); this is an element of 
[Ijn]". In Figure 0, an example of such a structure is depicted. Now, let Q 3 be 
the binary query defined as follows: Qa^a) := {{i,ai) \ i G {1, . . . ,n}}. By the 
numbers i and ai we mean respectively the i-th and Oi-th element of the chain. 
Observe that this query is injective. This query is expressible in BQL. We omit 
the details. 

Towards a contradiction, let P be an unnested BQL program that computes 
Q 3 . Let k be the maximum number of variables that are needed to express the 
assignments of P in FO. 

For any k > 2 and any n large enough there always exist two nonisomorphic 
graphs a, (3 G [2A:,n]" in which the cardinalities of the relations occurring in 
the heads of all the for-loops in P are equal. Indeed, the number of possible 
sequences of cardinalities of heads in P grows polynomially with n, while there 
are exponentially many elements in [ 2 /c,n]”. 

These a and f3 are indistinguishable in FO^^. Moreover, P{a), the result of 
P on input a, is indistinguishable in FO^^ from P{P), the result of P on input 
/3, since in both structures P is evaluated as the same sequence of FO^-definable 
substitutions. P{a) and P{f3) are indistinguishable subsets of chains of equal 
length, so they must in fact be equal. Hence, the query expressed by P is not 
injective. This leads to the desired contradiction. □ 






n 



Fig. 2. a = («!,..., a„) 



Note that the query Q 3 used in the above proof is not a Boolean query. It 
remains open whether there exists a Boolean query that separates BQL from 
unnested BQL. 

We next show that the nesting hierarchy is not strict. 

Theorem 9. BQL is equivalent to BQL 2 . 

Proof, (sketch) By structural induction we show that every BQL program P is 
equivalent to one of the form 

Pi; while T 0 do P 2 od, (*) 
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where Pi is a relational algebra program; P2 is an unnested BQL program; and 
the variable Y becomes empty on any structure after at most a polynomial 
number of iterations, so that the while-loop can be implemented by a for-loop. 
The important property is that if Qi and Q2 are two programs of the form (*), 
then Qi; Q 2 and while X 0 do Qi od are also equivalent to a program of the 
form (*), provided X becomes empty in polynomially many iterations. We call 
an RQL program that only contains while-loops that iterate on any structure 
only a polynomial number of times a PTIME RQL program. 

The key case is when P is for |X| do P' od, with P' already of the form 
(*) by induction. Let k be the maximum number of variables that are needed to 
express the assignments of P in FO, and let v be the number of relation variables 
occurring in P. 

To simulate P, we first compute the 2 A:-variable Abiteboul-Vianu invariant 
TT2k{A) |AV 95 | of the input structure A by a PTIME RQL program. This provides 
us with an ordering of the FO^^-equivalence classes. Using this ordering, an 
f-ary relation over the invariant can now encode any number between 0 and 
2^-\TT2k(A)\ _ 2^^ ^ using relations as counters, we can 

simulate the for-loop of P by a while-loop of the desired form. If, on the other 
hand, |A| > 2'’A2k{X)\ ^ then we know that the execution of the loop gets into 
a cycle, because there are only assignments of values to v relation 

variables. Let t be the number of iterations needed before the values of the 
relations of P' get into the cycle and let p be the cycle size. Then we can simulate 
P by iterating the body first t times and then (|A| — t) mod p times. The former 
can be done as in the previous case (the value of t is less than For 

the latter we make use of the equality |A| = X)[a]cx l[®]l> where [a] is the FO^^- 
equi valence class of a. 

The program iterating P' exactly (|A| — t) mod p times is shown in FigureEl 
Note that this is the only place where a real for-loop appears in the simulation. 



X' := N; counter := 0; 
while A' 7 ^ 0 do 

[a] := least class in X' according to the ordering of Ti 2 k{A)\ 
X' ■- X' - [d]; 

for I [a] I do counter := (counter -|- 1 ) modp od; 

od; 

while counter 7 ^ 0 do 

P'\ counter := counter — 1; 

od; 

Prest 5 



Fig. 3. The program iterating P' exactly (|A| — t) modp times. 



Here, Prest is the PTIME RQL program that iterates P' precisely p — {t mod p) 
times. We omit the description of the programs that compute the threshold t 
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and the cycle size p. It can be shown that they are equivalent to a program of 
the form (*). 

It remains to be noticed that the test whether |X| < gjj^_ 

ulated by a program of the form ( 7 k-) in the standard way. Now, using the trans- 
formations of while-loops specified above one can reduce all while-loops to one. 
This ends the induction step. □ 

4.2 Nesting in FO(FOR) 

Nesting in the head. Let FH-FO(FOR), FO(FOR) with first-order heads, be 
the fragment of FO(FOR) that does not allow for-loops in its head formulas. 
Consider the sentence 

^=-(3r;)[FOR#^’^(")-F(^;)](r;) 
where ijj{x) = 

It expresses the query Are there an even number of nodes with even outdegree ? 
We can simulate the for-loop in the head by introducing an extra for-loop that 
computes in its first iteration the relation defined by the head of ^ and computes 
in its second iteration the body of ^ using the relation it computed in its first 
iteration. This construction can be generalized to arbitrary FO(FOR) formulas. 
Hence, nesting in the head is dispensable: 

Proposition 10. FO(FOR) is equivalent to FH-FO(FOR). 

The above described construction, however, introduces relational parameters 
in head formulas. We next show that in general one cannot get rid of these 
parameters. Let PFH-FO(FOR), FO(FOR) with pure first-order heads, be the 
fragment of FH-FO(FOR) that forbids relational variables in head formulas of 
for-loops. 

To prove inexpressibility results for PFH-FO(FOR), we introduce an ex- 
tended version of the fc-pebble game IFF95I . First, the Duplicator has to preserve 
partial isomorphisms between the pebbles as in the ordinary fc-pebble game. But 
on top of that, he must also make sure that for any FO^ formula ip{x,y), if we 
fill in some pebbled elements a from the first structure A and take the cor- 
responding pebbled elements b in the second structure B (or vice versa) then 
|{a' I A 1= (^[a, a']}| = \{b' \ B |= ip\b,b']}\. This game provides us with the 
following tool: 

Lemma 11. Let Q be a Boolean query. If for every k, there exist structures Ak 
and Bk such that Q{Ak) yf Q{Bk), and the Duplicator has a winning strategy 
in the extended k-pebble game on Ak and Bk, then Q is not definable in PFH- 
FO(FOR). 

Proof, (sketch) The key observation is that any PFH-FO(FOR) formula ^ is 
equivalent to a formula of infinitary logic with counting, where the counting 
quantifiers 3^x (meaning there exactly k tuples x such that . . . ) are applied to 
FO formulas only. It is routine to verify that the lemma holds for this logic. □ 



68 



Frank Neven et al. 



Consider the following Boolean query over the vocabulary of graphs: Q 4 (Q) 
is true iff 

(i) every connected component of Q is ordered by the HKL query (cf. the para- 
graph before Section mi : 

(ii) the number of elements in each connected component is larger than the 
number of connected components; and 

(iii) there are exactly two isomorphism types of connected components and they 
appear in equal numbers of copies. 

This query can be shown to be definable in FO(FOR). Using Lemma El 
however, we can show that Q 4 is not definable in PFH-FO(FOR), as follows. 

By a counting argument one can show that for any k, there are two non- 
isomorphic connected graphs Gk and Hk over which every FO^ formula is equiv- 
alent to a quantifier free one, and moreover for every such FO^ formula ip it holds 
that |{5 I Gk 1= = \{g I Hk 1= :<5[g]}|. Additionally, Gk and Hk can be 

ordered by the HKL query and \Hk\, \Gk\ > 2k + 2. 

Let Ak be the disjoint union of A: -I- 1 copies of Gk and A: -I- 1 copies of Hk, and 
let Bk be the disjoint union of k copies of Gk and k+2 copies of Hk - The Duplica- 
tor has a winning strategy for the extended A;-pebble game on these structures: it 
suffices to ensure that the partial isomorphism required for the game can always 
be extended to a partial isomorphism defined for all the members of the con- 
nected components in which the pebbles are located. Because initially there are 
no pebbles on the board and there are enough copies of the graphs Hk and Gk 
in both structures, the Duplicator can maintain this strategy. It can be shown 
that under this strategy the conditions of the extended A:-pebble game are always 
satisfied. The key observation is that the cardinalities of definable relations are 
functions of the cardinalities of certain quantifier free definable relations over the 
components. In the absence of pebbles these cardinalities are equal in compo- 
nents isomorphic to Gk and Hk, and the components which contain parameters 
are isomorphic. 

We obtain: 

Theorem 12. PFH-FO(FOR) is strictly weaker than FO(FOR). 

Nesting in the body. Let FB-FO(FOR), FO(FOR) with first-order bodies, be the 
fragment of FO(FOR) that does not allow for-loops in body formulas. 

We next show that FO(FOR) is strictly more expressive than FB-FO(FOR). 
Let n * Q denote the disjoint union of n copies of graph Q. 

Let Qs be the following query over the vocabulary of graphs: Q 5 {G) is true 
iff 

(i) g = n*H; 

(ii) H can be ordered by the HKL query; 

(iii) n < \Ti.\. 

This query can be shown to be definable in FO(FOR). Q 5 is not definable in 
FB-FO(FOR). 
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The key observation is that if Ti satisfies extension axioms for sufficiently 
many variables (at least as many as there are in the sentence), then the unnested 
bodies of for loops start to cycle within a bounded number of iterations already, 
while Qs requires counting up to \H\. 

Consequently, no FB-FO(FOR) sentence can define Q5. We obtain: 

Theorem 13. FB-FO(FOR) is strictly weaker than FO(FOR). 
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Abstract. We study the expressive power of various query languages on rela- 
tional databases of bounded tree- width. 

Our first theorem says that fixed-point logic with counting captures polynomial 
time on classes of databases of bounded tree-width. This result should be seen 
on the background of an important open question of Chandra and Harel ask- 
ing whether there is a query language capturing polynomial time on unordered 
databases. Our theorem is a further step in a larger project of extending the scope 
of databases on which polynomial time can be captured by reasonable query lan- 
guages. 

We then prove a general definability theorem stating that each query on a class 
of databases of bounded tree-width which is definable in monadic second-order 
logic is also definable in fixed-point logic (or datalog). Furthermore, for each k > 
1 the class of databases of tree-width at most k is definable in fixed-point logic. 
These results have some remarkable consequences concerning the definability of 
certain classes of graphs. 

Finally, we show that each database of tree-width at most k can be characterized 
up to isomorphism in the language the {k -|- 3)-variable fragment of first- 

order logic with counting. 



1 Introduction 

The tree-width of a graph, which measures the similarity of the graph to a tree, has 
turned out to be an indispensable tool when trying to find feasible instances of hard 
algorithmic problems. As a matter of fact, many NP-complete problems can be solved 
in linear time on classes of graphs of bounded tree-width (see Q for a survey). Nu- 
merous examples can easily be obtained by a result of Courcelle 0 saying that any 
query definable in monadic second-order logic (MSO) can be evaluated in linear time 
on graphs of bounded tree-width. Courcelle’s MSO allows to quantify over sets of edges 
and sets of vertices; this makes it quite powerful. For instance, hamiltonicity of a graph 
is defined by an MSO-sentence saying that “there exists a set X of edges such that each 
vertex is incident to exactly two edges in X, and the subgraph spanned by the edge-set 
X is connected”. The importance of these results relies on the fact that many classes of 
graphs occurring in practice are of small tree-width. 

The notion of tree-width of a graph has been introduced (under a different name) by 
Halin Ca and later re-invented by Robertson and Seymour [20j. It has a straightforward 
generalization to relational databases (due to Feder and Vardi m, also see ifTO l. It 
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is easy to see that most of the results on graphs of bounded tree-width generalize to 
relational databases. Among them is Courcelle’s theorem. It immediately implies that 
on a class of databases of bounded tree-width each first-order query can be evaluated in 
linear timeQ 

Our main task is to study the expressive power of fixed-point logic with and without 
counting on databases of bounded tree- width. Of course, our results transfer to the query 
languages datalog and datalog with counting, both under the inflationary semantics, 
which are known to have the same expressive power as the corresponding fixed-point 
logics ( 11111211 1. 

Theorem 1. Let fc > 1 . Fixed-point logic with counting captures polynomial time on 
the class of all databases of tree-width at most k. 

In other words, a “generic” query on a class of databases of bounded tree-width 
is computable in polynomial time if, and only if, it is definable in fixed-poinf logic 
with counting. It is important to note here that we are considering classes of unordered 
databases; on ordered databases the result is a just a special case of the well-known 
result of Immerman and Vardi saying that fixed-point logic captures polynomial 

time on ordered databases. The problem of capturing polynomial time on unordered 
databases goes back to Chandra and Harel Q. Our result should be seen as part of a 
larger project of gradually extending the scope of classes of databases on which we 
can capture polynomial time by “nice” logics. The main motivation for this is to get a 
better understanding of “generic polynomial time”, which may contribute to the design 
of better query languages. Though it is known that fixed-point logic with counting does 
not capture polynomial time on all databases 0|, it has turned out to be the right logic 
in several special cases. The best previously known result is that it captures polynomial 
time on the class of planar graphs j141|. Our TheoremlDis somewhat complementary by 
a result of Robertson and Seymour |22| saying that a class C of graphs is of bounded 
tree-width if, and only if, there is a planar graph that is not a minor (see below) of any 
graph in C. 

We next turn to fixed-point logic without counting. Although this weaker logic does 
certainly not capture polynomial time on databases of bounded tree-width, it remains 
surprisingly expressive. 

Theorem 2. Let k > 1 and r a database schema. 

(1) The class of all databases over t of tree-width at most k is definable in fixed-point 
logic. 

(2) Each MSO-definable query on the class of databases of tree-width at most k is 
definable in fixed-point logic. More precisely, for each \^SO-definable class D of 
databases the class 

{D £T> \ D is of treewidth at most k'\ 
is definable in fixed-point logic^ 

* More precisely, for each first-order formula <p{x) there is a linear-time algorithm that, given a 
database D and a tuple a, checks whether D |= 

^ For convenience we have only formulated the result for Boolean queries, but the generalization 
to arbitrary queries is straightforward. 
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This theorem has some nice applications to graphs. To obtain the full strength of 
Courcelle’s MSO with quantification over edge-sets, we consider graphs as databases 
over the schema {V, S, /} with unary V, E and a binary /. A graph is then a triple G = 
, I^), where is the vertex-set, E^ is the edge-set, and C x E^ 
the binary incidence relation between them0Thus Theorem|5|implies, for example, that 
hamiltonicity is fixed-point definable on graphs of bounded free-width. 

Recall fhaf a graph iT is a minor of a graph G if it is obtained from a subgraph of G 
by contracting edges. It can easily be seen that for each fixed graph H there is an MSO- 
sentence saying that a graph G contains H as a minor. In their Graph Minor Theorem, 
Robertson and Seymour o proved that for each class C of graphs closed under taking 
minors there are finitely many graphs Hi, .. . , such that a graph G is in C if, and 
only if, it contains none of Hi, .. . , Hn as a minor. (Actually, a restricted version of 
this theorem, proved in Q), suffices for our purposes.) Together with the fact that the 
class of all planar graphs is fixed-point definable im and the abovementioned result of 
Robertson and Seymour that a class C of graphs is of bounded tree-width if, and only if, 
there is a planar graph that is not a minor of any graph in C, we obtain a quite surprising 
corollary: 

Corollary 1. Let C be a class of planar graphs that is closed under taking minors. Then 
C is definable in fixed-point logic. 

As another by-product of our main results, we obtain a theorem that continues a 
study initiated by Immerman and Lander ifTTm . denotes the fc- variable (first-order) 

logic with counting quantifiers. 

Theorem 3. Let k > 1. For each database D of tree-width at most k there is a 
sentence that characterizes D up to isomorphism. 

On fhe ofher hand, Cai, Fiirer, and Immerman m proved that for each fc > 1 there 
are non-isomorphic graphs G,H of size 0(fc) that cannot be distinguished by a C^- 
sentence. 

2 Preliminaries 

A database schema is a finite set r of relation symbols with associated arities. We 
fix a countable domain dom. A database instance, or just database, over the schema 
T = {i?i, . . . , Rn] is a tuple D = (i?P, . . . , i?^), where Rf is a finite relation on 
dom of the arity associated with Ri. The active domain of D, denoted by A^, is the set 
of all elements of dom occurring in a tuple contained in any of the Rf. For convenience, 
we usually write a G D instead of a S we even write d G D instead of d G 
for fc-tuples d. By d we always denote a tuple oi . . . a^, for some fc > 1. The size of a 

^ It is more common in database theory to encode graphs as databases over {V, R}, where V 
is unary and R binary. But it is easy to see that there is a first-order definable transformation 
between graphs {V, R) and the corresponding (V, E, I) and vice-versa, and it changes the 
tree- width by at most 1 . 
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database D, denoted by |I7|, is the size of its active domain. In general, jS”! denotes the 
size of a set S. 

If is a database over t, X a. fc-ary relation symbol not contained in r, and X* a 
Ic-ary relation over dom, then (Z7, X*) denotes the database over t U {X} defined in 
the obvious way. Occasionally we also need to expand a database D by distinguished 
elements, for a tuple d £ dom we write {D, d) for the expansion of D by d. An isomor- 
phism between {D, oi . . . a/) and {E, bi . . . bi) is a bijective / : U {ai, . . . , a/} ^ 

U {bi , ... ,bi} such that /(oi) = bi for i < I and / is an isomorphism between D 
and E. 

A subinstance of a database D over t is a database E over t such that for all i? £ r 
we have C R^. For B C dom we let {B)^ denote the subinstance E of D where 
for all R G T, say, of arity r, we have R^ = R^ n B^. We often omit the superscript 
^ and just write (B) if D is clear from the context. 

The union, intersection, difference of a family of databases over the same schema 
is defined relation-wise. For example, the union of two databases D and E over r is the 
database D\J E over r with R^^^ = R^ U R^ for all i? £ t. If Z7 is a database and 
B C dom we let \ B = {A^ \ B)^ . Furthermore, for an Z-tuple d £ dom we let 
D\d = D\ {ai, . . . ,ai}. 

An ordered database is a database D over a schema that contains the binary relation 
symbol < such that is a linear order of the active domain of D. 

We assume that the reader is familiar with the notions of a query and query lan- 
guage. Most of the time, we restrict our attention to Boolean queries^ which we just 
consider as isomorphism closed classes of databases over the the same schema. A 
Boolean query on a class T> of databases is a Boolean query Q which is a subclass 
ofX>. 

We say that a query language L captures a complexity class K if for each query Q 
we have: Q is definable in L if, and only if, it is computable in K. We say that L captures 
K on a class T> of databases if for each query Q on I? we have: Q is definable in L if, 
and only if, it is computable in K. 

For example, a well-known result of Immerman and Vardi 1 1 bl24l says that least 
fixed-point logic captures polynomial time on the class of all ordered databases. 



2.1 Fixed-Point Logics 

We assume a certain familiarity with first-order logic FO. The query languages we are 
mainly interested in are inflationary fixed-point logic IFP and inflationary fixed-point 
logic with counting IFP+C. 

The set of \FP -formulas is obtained adding the following formula-formation rule to 
the usual rules to form first-order formulas: 

Given a formula ip over the schema r U {-A}, where A is a fc-ary relation 
symbol for some fc > 1, and two fc-tuples x, z of variables, we may form the 
new formula \\FPx^x<f]z over r. 

^ This restriction is inessential and can easily be removed. But it simplifies things a little bit. 
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The semantics of IFP is, as usually, given by a satisfaction relation \=. In particular, 
consider an IFP-formula ip{x, y) over a schema t U {X}, such that the free variables 
of occur in the tuples x, y, and a database D over r. For each tuple b G D we let 

= 0 and = Xf U {a G D \ {D, xf) \= (p{a, b)} (for i > 0). Furthermore, we 
let X^^ = Uj>i Xf. Then D |= [IFP^^jc < f(^) ^)]c if, and only, c G X^^ (for all c G D). 

To define IFP+C we need to introduce a second countable domain num disjoint 
from dom. We think of num as a copy of the non-negative integers, and we usually 
do not distinguish between an element of num and the corresponding integer. We also 
need variables of sort num which we denote by v. The symbols x, y, z always refer to 
variables of sort dom. If we do not want to specify the sort of a variable, we use symbols 
u, V, w. Furthermore, we let ^ denote a binary relation symbol which is supposed to 
range over num. 

For each database D over t we let be the initial segment {0, . . . , |ZJ|} of num 
of length |U| + 1, and we let be the natural order on . We let = {D, 
considered as a database of schema = r U { ^ } on the domain dom U num. Note that 
the active domain of is U N^. We can now consider IFP on databases with this 
extended domain. 

The set of IFP+C-formulas is defined by the same rules as the set of IFP-formulas 
and the following additional rule: 

If is a formula and x, y. are variables then B^^^xcp is a new formula. (Recall 

that by our convention x ranges over dom and y over num.) 

To define the semantics, let ip{x, w) be a formula over whose free variables all 
occur in x, w, let ZJ be a database over r, and d G U N^, i G . Then ^ 

3^’^x(fi{x, d) if, and only if, the number of a G A^ such that ZJ^ \= <y{a, d) is i. 

So far, our IFP+C-formulas only speak about databases ZJ^. However, for formulas 
ly without free wnm-variables, we can let ZJ \= y ^ y. This way, IFP+C- 

formulas also define queries in our usual framework of databases on the domain dom. 

IFP+C has turned out to capture polynomial time on several classes of databases. It 
follows easily from the Immerman-Vardi-Theorem mentioned earlier that IFP+C cap- 
tures polynomial time on the class of all ordered databases. The following Lemma, 
which is based on the notion of definable canonization introduced in JOd, can be 
used to extend this result to further classes of databases. The straightforward proof can 
be found in o. 

For an IFP+C-formula(p(w}) and a database ZJ we let = {d G ZJ^ | \= 

y{d)}. Note, in particular, that if w is an Ltuple of nnm- variables, then y{w)^* is an 
Z-ary relation on num. 

Lemma 1. Let r = {Z?i, . . . , Z?„} be a database schema, where Ri is ri-ary, and let 
V be a class of databases over t. Suppose that there are \FP+C-formulas yi{yi), . . . , 
fniy-n), where yi is an ri-tuple of num-variables, such that for all D G T> the database 

{yiifh)^* , . . . ,yn(yn}^*) 

(which is a database over r on the domain num) is isomorphic to D. 

Then IFP+C captures polynomial time on T>. 
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The reason that this holds is that we can speak of the order on in IFP+C. 
Thus essentially the hypothesis of the lemma says that we can define ordered copies of 
all databases D gV in IFP+C. 



3 Tree Decompositions 

Deviating from the introduction, from now on we find it convenient to consider graphs 
as databases over {V,E}, where V is unary and E is binary. Without further explana- 
tion, we use usual graph theoretic notions such as paths, cycles, etc. A tree is a con- 
nected, cycle-free graph. 

Definition 1. A tree-decomposition of a database D is a pair (T, where T is 

a tree and a family of subsets of such that ~ ^ and for each 

a G D the subgraph {{t \ a G Bt})^ ofT is connected. 

The Bt are called the blocks of the decomposition. 

The width of{T, (Bt)teT) is max{|i?t| \ t G T} — 1. The tree-width of D, denoted 
by tw(Z7), is the minimal width of a tree-decomposition of D. 

On graphs, this notion of tree-decomposition coincides with the usual notion intro- 
duced by Robertson and Seymour C(3l. 

The first thing we will do now is present two basic and well-known lemmas that 
give rise to a fixed-point definition of the class of databases of tree-width at most k. 
This requires some additional notation. We recommend the reader to really get familiar 
with this notation and the lemmas, since they will be used again and again throughout 
the whole paper. For other basic facts about tree-decompositions and tree-width we 
refer the reader to Similar techniques as we use them here have been employed by 
Bodlander in OBI 

Definition 2. Let k > 1 and D a database. A /c -preclique ofD is a {k-\- l)-tuple d G D 
of distinct elements such that D has a tree-decomposition (T, (Bt)teT) of width at most 
k with Bt = {oi, . . . , ak+i}for some t G T. 



Lemma 2. Let k > 1 and D a database of tree-width at most k and size at least (fc-F 1). 

(1) D has a tree-decomposition (T, {Bt)) with \Bt\ = k-\-lfor allt G T and Bt Bu 
for all distinct t,u G T. 

In other words, D has a tree-decomposition whose blocks are pairwise distinct 
k-precliques. We call such a tree-decomposition a regular tree-decomposition. 

(2) For each k-preclique d of D there is a regular tree-decomposition (T, {Bt)teT) of 
D such that there is at gT with Bt = {oi, . . . , Ofc+i}. 

Proof. To prove (1), let (T, {Bt)t^T) be a tree-decomposition of D of width k such that 
among all tree-decompositions of D of width k: 

(i) |T| is minimal. 

(ii) Subject to (i), \^t \ maximal. 
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(i) clearly implies that there are no adjacent t,u G T such that Bt C because 
otherwise we could simply contract the edge tu in T and obtain a smaller tree. Now 
suppose \Bt\ < k + 1 for some t G T. Let m be a neighbor of t and a G Bu \ Bt- We 
can simply add a to Bt and obtain a new tree-decomposition with a larger sum in (ii). 
This contradicts our choice of (T, (Bt)). 

(2) can be proved similarly. □ 

With each database D over r we associate its Gaifman graph G{D) with vertex set 
yG(D) _ j^D edge between two vertices a, b if there is a tuple c in one of the 

relations of D such that both a and b occur in c. Now we can transfer graph theoretic 
notions such as connected components or distances to arbitrary databases; we just refer 
to the respective notions in the Gaifman graph. 

By comp(U) we denote the set of active domains of the connected components 
of a database D. Hence the connected components of D are the subinstances (C) for 
C G comp(ZJ). 

Let Z > 1, ZJ a database, and a G D an Z-tuple of vertices. For b G D we let 
denote the (unique) C G comp(U \ d) with b G C; if 6 G {ai, . . . ,o/} then 
is the empty set. Note that comp(U \ d) = {C^°‘ \ b G D}. Furthermore, we let 
C+" = C,-"u{ai,... ,a,}. 

For an Z-tuple a,ani < I, and any c we let a/i denote the (Z — l)-tuple obtained from 
d by deleting the ith component and a(c/i) the Z-tuple obtained from d by replacing the 
Zth component by c. 

Observe that a fc-preclique of a database D either separates D or is the block of a 
leaf in every regular tree-decomposition of D where it occurs as a block. Also note that 
a database has tree-width at most k if, and only if, it has size at most k or contains a 
Zc -preclique. 

The following lemma (for graphs) is essentially due to Arnborg, Cornell, and Pros- 
kurowski d- 

Lemma 3. Let k > 1, D a database, and d G D a (k + l)-tuple of distinct elements. 

(1) The tuple d is a k-precUque of D if and only if d is a k-precUque of for all 
b G D. 

(2) For all b G D \d, the tuple d is a k-preclique of (C^“) if and only if, there are 

i < k -\- 1 and c G C'f°‘ such that d/i isolates at in (that is, at is not 

adjacent to any d G in the Gaifman graph of D) and d(c/i) is a k-preclique 

o/(C+"/*)(=(C+")\a.). 

(3) For all b G {ai, . . . , ak+i}, the tuple a is a k-preclique of 




Figure 1. 
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Proof. Note first that if (T, (Bt)t^T) is a tree-decomposition of a database D and E is 
a subinstance of D then (T, n Bt)t^T) is a tree-decomposition of E of at most the 
same width. 

Then the forward direction of (1) is obvious. For the backward direction we can 
simply paste together decompositions of the C^“ in which {oi, . . . , 0 ^+ 1 } forms a 
block along these blocks. 

For the forward direction of (2), let (T, {Bt)t^T) be a regular tree-decomposition 
of that contains at G T with Bt = {ai, . . . , Ofc+i}. Since a does not separate 

(C^“), t must be a leaf of T. Let u be the neighbor of t in T. Let i G {1, ... , fc -I- 1} 
such that Qi G Bt\ B^ and c G \ Bt - The claim follows. 

For the backward direction, let (T, (Bt)teT) be a tree-decomposition of 
that contains a t G T with Bt — {ai, . . . , ai_i, c, Oi+i, . . . , ak+i}- We add a new 
vertex u to T and make it adjacent to t. Furthermore, we let Bu = {ai , . . . , Ofc+i}. We 
obtain a tree-decomposition of of the desired form. 

(3) is trivial, since = {oi, . . . .Ufe+i} in this case. □ 

Combining (1) and (2) we obtain the following: 

Corollary 2. Let k > 1, D a database, d G D a {k+ l)-tuple of distinct elements, and 
b G D\d. 

Then d is a k-preclique in if, and only if, there is an i < (fc -|- 1) and a c G 

such that d/i isolates at in (C^“) and a{c/i) is a k-preclique in all subinstances 
(C+“(^/*)), where d G 

Note that the subinstances are smaller then (C^“). 

This gives rise to an inductive definition of the relation P consisting of all (a, h), 
where a is a (fc -b 1) -tuple and b a single element, with the property that a is a preclique 
in (Cj('“): We start by letting P consist of all (a, Oi), where a is a (fc -b l)-tuple and 
1 <i < fc-bl. Then we repeatedly add all pairs (d,b) for which there exists an i < fc-bl 
and ac G such that d/i isolates at in (C^“) and we already have {a{c/i), d) G P 
for all d G . This inductive definition can easily be formalized in IFF. Thus we 

have: 

Lemma 4. There are \FP-formulas ip(x,y) and f^{x) such that for all databases D, 
(fc + l)-tuples d G D and b G D we have 

D 1= (fid, b) d is a k-preclique in (C^“), 

D 1= '0(a) d is a k-preclique. 

As a corollary we obtain the first part of Theorem|2 

Corollary 3. Let t be a database schema and fc > 1. Then the class of all databases 
over T of tree-width at most fc is definable in IFP. 
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4 Canonizing Databases of Bounded Tree- Width 

In this section we sketch a proof of Theorem[IJ The basic idea of the proof is the same 
as used in the proof that IFP+C captures polynomial time on the class of trees given in 

m. 

Let us fix a fc > 1. Without loss of generality we restrict our attention to databases 
over the schema {i?}, where i? is a binary relation symbol. 

We want to apply LemmaOIto the class T) of databases over {i?} of tree-width at 
most k. We shall define a formula i^) such that for all databases D G T> we have 
D ^ <f{n, v)^* . 

For the course of our presentation, let us fix a database D G V; of course the formula 
(p we are going to obtain does not depend on D but works uniformly over T>. 

Inductively we are going to define a (fc -f 4)-ary relation X C x 

with the following property; 

For all a,b G D such d is a A: -preclique in we have: 

(i) There is an isomorphism / between ((C^“), d) and {Xab—, 1 . . .k + 1), where 
Xab__ = {ij G (iV^)^ | abij G X}. 

(Recall that ((C^“), d) denotes the expansion of the database by the distin- 

guished elements d.) 

(ii) For all b' G C^“ we have Xab._ = Xab' 

We start our induction by letting X be the set of all tuples abij where a G D is a 
{k + l)-tuple of distinct elements, b G {ai, . . . , Ok+i}, and i,j G {1, ... , fc -f 1} are 
chosen such that UiUj G R^. 

For the induction step, recall CorollaryQ We consider d, c G ZJ, z G {1, . . . , fc-f 1} 
such that 

- a ji isolates in 

- for all d G Cc d(c/z) is a fc-precliqueof and the relation Xd(c/z)d__ 

has already been defined, 

- Xdc__ has not been defined yet. 

Note that these conditions imply, by Corollary 0 that d is a fc-preclique in (C+“). 

Let a! = aicji). For the d G we remember that Xd'd— is a database over 

{R} whose active domain is contained in N^. On we have the order available. 
Hence we can view these two relations together as an ordered database over the schema 
{i?, Ordered databases, in turn, can be ordered lexicographically. 

Let Cl, . . . , Cl he a list of the components (Cj"® ), for d G Cc ■ For 1 < z < Z 

and d G Ci, let Xi = Xdd By (ii), Xi does not depend on the choice of d. Some of 

the Xi may appear more than once, because different components may be isomorphic. 
We produce another list Yi , . . . , (where m < 1) that is ordered lexicographically 
(with respect to the order explained in the previous paragraph) and that only contains 
each entry once, and we let Ui be the number of times Yi occurs in the list X\, . . . ,Xi. 
(The list Yi, . . . , Ym can actually be defined in IFP-i-C; counting is crucially needed fo 
obtain the multiplicities rii.) 
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Let C denote the subinstance of D obtained by pasting the components C\, . . . ,Ci 
together along the tuple a'. Using the Yi and rii we define a binary relation Y on 
such that (C, o') is isomorphic to (Y, 1 ... fc + 1). It is clearly possible to do the arith- 
metic necessary here within the formal framework of our logic IFP+C. 

In a second step we add an element representing in an appropriate position and 
rearrange Y to obtain an X'ac— that satisfies (i). However, (ii) is not guaranteed so far 
because our definition may depend on the choices of c and i. So finally we let Xdc— 
be the lexicographically first among all the X'dd for all suitable choices of c' and i. 
For all h € we let Xdb— = Xdc 

This completes the induction step. 

We eventually reach a situation where we have defined Xdb— for all fc-precliques 
d and for all b. Similarly as in the definition of Y above, for each fc-preclique d we 
can now define a binary Za on such that {D, d) is isomorphic to (Zg, 1 . . . fc + 1). 
In other words, for all d we obtain an ordered copy of D whose first (fc + l)-elements 
represent d. We pick the lexicographically smallest among all the Za to be our canonical 
copy. 

This definition can be formalized in IFP-i-C and gives us the desired formula ip. □ 

Theorem 0 can be proved by a similar induction, though its proof is simpler be- 
cause we do not have to define a canonical copy of our database, but just determine its 
isomorphism type. Due to space limitations, we omit the proof. 

5 Monadic Second Order Logic 

The set of MSO-formulas is obtained adding the following formula-formation rule to 
the usual rules to form first-order formulas: 

Given a formula ip over t U {X}, where W is a unary relation symbol, we may 

form a new formula 3Xip over r. 

Furthermore, we use the abbreviation dXip for SX^ip. 

The semantics of MSO is obtained by inductively defining a relation the only 
interesting step being 

D 3Xp> There is a subset X* C such that {D, Y*) |= tp. 

In this section we want to prove the second part of Theorem |3 (9n a class of 
databases of bounded tree-width, each MSO-definable query is \ fP-definable. 

To prove Courcelle’s H result that each MSO-definable query on graphs of bounded 
tree-width is computable in linear time one may proceed as follows: The first step is to 
compute a tree-decomposition of the input graph in which each vertex has valence at 
most 3. Then the crucial observation is that if we have such a tree-decomposition, we 
can describe each MSO-formula by a finite tree-automaton. It is not hard to simulate 
such an automaton in linear time. 

We proceed in a similar spirit, having to deal with two problems. The first is that we 
do not know how to define a tree-decomposition of a database in IFP. But inductively 
climbing along the precliques is almost as good. The second problem is that we cannot 
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guarantee a bounded valence, that is, we cannot give a bound on the number of compo- 
nents that may occur in Lemma01). However, in the proof sketched above this is 
needed to make the automaton finite. To overcome this problem we first need to prove 
the following “padding lemma”. 

The quantifier- rank of an MSO-formula is the maximal number of nested (first and 
second-order quantifiers) occurring in the formula. For a database D and an Z-tuple 
a G D we let tpr(U, d) be the set of all MSO-formulas <P{x) of quantifier-rank r such 
that D ^ ^(d)- The tuple d may be empty, in this case we just write tpr(U). We let 

types(r, I, r) = {tpr(U, d) | D database over r,a G D ^-tuple}. 

Note that this set is finite for all r,l,r. 

Lemma 5. Let t be a database schema and l,r > 1. Then there is an integer K = 
K{T,l,r) such that for all databases D,E over t and l-tuples d G D, b G E the 
following holds: If for all 9 G types(r, Z, r) we have 

min (iT, |{C G comp(ZJ \ d) | tpr((C''"“)'°, d) = 0}|) 

= min {K, |{C G comp(F^ \ b) \ tpr((C"''^)'®, 6) = 0}|) 
then tpr(U, d) = tpr(i?, b). 

The proof of this lemma is an induction on r. It uses the Ehrenfeucht-Fraisse game 
characterizing MSO-equivalence. 

The essence of the lemma is that to compute tpr(U, d) we only need a finite amount 
of information on the types of the components {0^°“)^ . 

We can arrange this information in a large disjunction of formulas saying: 

If there are k\ components C such that {C^^) = 0i and lt 2 components C 
such that {C^^) = 02 and . . . and km components C such that {C^^) = 9m 
then the type of the whole database is 9. 

Here 0i, . . . ,9m range over the set types(r, l,r). The ki are integers between 0 and 
K{t, I, r), and in case ki = K{r, I, r) then the ith conjunct is to be read “if there are at 
least ki components” (that is, we only count up to K{r,l, r)). 

Recall LemmaOl Using it we define, by a simultaneous induction, {k 2)-ary re- 
lations Xg, for 9 G types(r, A: -|- 1, r). The intended meaning of xy G Xg is “x is a 
A: -preclique in (C+“) and tp^ ( {Cy^), x) = 9”. It is not hard to formalize such an induc- 
tive definition in IFF. Since for all r, D, d the type tpr{D) is determined by tpr{D, d), 
we have proved the following lemma. 

Lemma 6. Let t be a database schema, r,k > 1, and 9 G types(r, 0, r). Then there 
is an IFP -sentence pg such that for all databases D over t of tree-width at most k we 
have 

tpr(U) = 9 D ^ ipg. 

Theorem 132) follows, since each MSO-sentence <I> over r of quantifier-rank r is 
equivalent to the (finite) disjunction 

V 

0Gtypes(r,O,r), 
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6 Conclusions 

We have studied the concept of tree-width of a relational database and seen that from a 
descriptive complexity theoretic perspective classes of databases of bounded tree-width 
have very nice properties. 

Of course our results are of a rather theoretical nature. The more practically minded 
will ask whether databases occurring in practice can be expected to have small tree- 
width. Probably, this will not be the case for the average database. However, if in a 
specific situation it is known that the databases in question will have a small tree width, 
then it may be worthwhile to explore this. It seems plausible to us that in particular data 
carrying not too much structure can be arranged in a database of small tree-width. 

This point of view may be supported by a different perspective on tree-width, which 
roughly says that tree-width is a measure for the “global connectivity” of a graph (or 
database). Essentially, the tree-width of a graph is the same as its Unkedness, which is 
the minimal number of vertices required to split the graph and subsets of its vertex-set 
into more or less even parts (see El) 
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Abstract. We study the problem of deciding satisfiability of first order 
logic queries over views, our aim being to delimit the boundary between 
the decidable and the undecidable fragments of this language. Views 
currently occupy a central place in database research, due to their role 
in applications such as information integration and data warehousing. 
Our principal result is the identification of an important decidable class 
of queries over unary conjunctive views. This extends the decidability 
of the classical class of first order sentences over unary relations (the 
Lowenheim class). We then demonstrate how extending this class leads 
to undecidability. In addition to new areas, our work also has relevance 
to extensions of results for related problems such as query containment, 
trigger termination, implication of dependencies and reasoning in de- 
scription logics. 



Key words: satisfiability, decidability, first order logic, database queries, data- 
base views, conjunctive queries, unary views, inequality, the Lowenheim class. 



1 Introduction 

The study of views in relational databases has recently attracted considerable 
attention, with the advent of applications such as data integration and data 
warehousing. In such domains, views can be used as “mediators” for source 
information that is not directly accessible to users. This is especially helpful in 
modelling the integration of data from diverse sources, such as legacy systems 
and/or the world wide web. 

Much of this recent research has been on fundamental problems such as con- 
tainment and rewriting/optimisation of queries using views [1 2j . In this paper, 
we examine the use of views in a somewhat different context, such as for mediat- 
ing and monitoring constraints on data sources where only views are available. 
An important requirement is that the resulting query/constraint language be de- 
cidable. We give results on the decision problem in this paper. As a by-product, 
our results also help us to analyse the specific problems mentioned above. 
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1.1 Statement of the Problem 

Consider a database with a schema over a set of base relations Ri, . . . ,Rp and 
a set of views V\, . . . ,Vn defined over them. A first order view query is a first 
order query solely in terms of the given views. For example, given the views 

l/i(A, Y) ^ R{X, Y),S{Y, Y, Z),T{Z, A, X),^Q{A, X) 
V2{Z)^R{Z, Z) 

then 3X,Y{(Vi{X,Y) V Vi{Y,X)) A -^V 2 {X)) A yZ{V 2 (Z) => Vi{Z,Z)) is an 
example first order view query, but 3X,Y{V\{X,Y) V R{Y,X)) is not. 

First order view queries are realistic for applications where the source data 
is unavailable, but summary data (in the form of views) is. Since most database 
languages are based on first order logic (or extensions thereof), we choose this 
as the language for manipulating the views. 

Our purpose in this paper is to determine, for what types of view definitions, 
satisfiability (over finite models) is decidable for the language. If views can be 
binary, then this language is as powerful as first order logic over binary base 
relations, and hence undecidable (even with only identity views). The situation 
becomes much more interesting, when we restrict the form views may take - in 
particular, when their arity must be unary. Such a restriction has the effect of 
constraining which parts of the database can be “seen” by a query and the way 
different parts of the database may interact. 

1.2 Contributions 

Our main contribution is the definition of a language called the first order unary 
conjunctive view language and a proof of its decidability. As the name suggests, it 
uses unary arity views defined by conjunctive queries. It is a maximal decidable 
class in the sense that increasing the expressiveness of the view definition results 
in undecidability. Some interesting aspects of this result are: 

— It is well known that first order logic over solely monadic relations is de- 
cidable, but the extension to dyadic relations is undecidable P]. The unary 
conjunctive view language can be seen as an interesting intermediate case 
between the two, since although only monadic predicates (views) appear in 
the query, they are intimately tied to database relations of higher arity. 

— The language is able to express interesting properties such as generalisations 
of unary inclusion dependencies and containment of monadic datalog 
queries under constraints. This is discussed in Section 4. 

1.3 Paper Outline 

The paper is structured as follows: Section Q defines the unary conjunctive view 
language and shows its decidability. Section 0 shows that extensions to the lan- 
guage, such as negation, inequality or recursion in views, result in undecidability. 
Section 0 looks at applications of the results. Section 5 gives some comparisons 
to other work and section 0 summarises and then looks at future work. 
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2 Unary Conjunctive Views 

The first language we consider is first order queries (without inequality) over 
unary conjunctive views. Each view has arity one and it is defined by a union of 
some arbitrary conjunctive queries. Some example views are 

Vi (X) ^ R{X, Y),R{Y, Z),R{Z, X) 

U(V) ^ S{X,X,Y),T{Z,X),R{Y,Y) 

V2{X)^ S{X,X,X) 

VsiX) ^ T{X, Y),R(Y, X), S{X, Y, Z) 

and an example first order query over them is 

3X(U(V)A-V3(V)AVy(U(^) ^ (-U(F)VU(n))) 

Note that this sentence can of course be rewritten to eliminate the views, since 
each view is just a positive existential query (e.g. Vi{X) can be rewritten as 
3y, Z{R{X, Y) A i?(U Z) A R{Z, X))). Observe that there is no restriction on the 
number of variables that can be used by a view. Since this language can express 
the property that a graph has a clique of size fc -I- 1, it is not subsumed by any 
finite variable logic Lk (for finite k) [Jj . Of course, having unary arity does have 
a price. In particular, some restrictions inherent in this language are 

— Limited negation: Since negation can only be applied to the unary views, 
properties such as transitivity (W, Y {edge{X, YY\edge{Y, Z) edge{X, Z))) 
cannot be expressed. 

— No Inequality. It isn’t possible to test inequality of variables and in particular 
it is impossible to write a sentence specifying that a single attribute X of a 
relation functionally determines the others (namely Y, Z{edge{X, Y) A 
edge{X,Z) AY Z). 

Our results in this paper verify that this language indeed has these two limi- 
tations, since the language itself is decidable, whereas adding or ^ results in 
undecidability. Given these limitations, a proof of decidability might seem to be 
straightforward. However, this turns out not to be the case. Our plan is to show 
that the first order view queries have a bounded model property, for arbitrary 
fixed view definitions. To accomplish this, we need to construct a small model 
using a bounded number of constants from a large model. Difficulties arise, how- 
ever, from the fact that sentences such as -•3X{Vi{X)AV2{X)A^V^{X)A-~V4^{X)) 
can be expressed. Such sentences act like constraints on the space of possible 
models. When constructing the small model, we need to map different constants 
in the large model to identical ones, and we must do so without violating such 
constraints. We achieve this by using a procedure similar to chasing, plus an 
interesting duplicating and wrap-around technique. The rest of this section is 
devoted to describing this procedure and technique. 

Theorem 1. Satisfiability for the first order unary conjunctive view query lan- 
guage is decidable. In fact it is in SPACE(2 ^”^^p'>'"), but NTIME(2 p/^°i>p) 
hard, for queries using views of maximum length m over p relations. 
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Proof. Let i/) be an arbitrary first order view query in the language. The first 
step consists of transforming if to tf' in order to eliminate the union of views, 
that is, Ip' is equivalent to ip and ip' only uses views without union. %p' is easily 
constructed by replacing each V{X) in ip using the disjunct V\{X) V V 2 {X) ... V 
Vw{X), where (i) Vs definition uses union, (ii) Vi, - ■ ■ ,Vw is an enumeration of 
the conjunctive views in the definition of V , and (iii) no union is used in defining 
each Vi- Let there be n views in ip' over the p relations. Let m be the maximum 
number of variables (including duplicates) used in defining any view body. For 
example, m = 7 for the views listed at the beginning of section In addition 
to these views, we define all possible other views using at most m variables 
(including duplicates). Let N denote the total number of non isomorphic views 
obtained. Then it can be easily verified that N < {m x p)"* x m. We then 
“and” the formula (V), V —Vx) with ip' , for each view Vx that didn’t originally 
appear in Fl . The purpose of defining these “redundant” views is to ensure 

maximum coverage of equivalence classes (whose definition will be given shortly) . 

It is possible to interpret ip' in two possible ways. The standard way is by 
treating it as a first order formula over base-relation schema {Ri, Rp} (and 
indeed this corresponds to our intended meaning). It is also possible to inter- 
pret it in a non standard way, as a first order formula over the view schema 
{Fi, ..., Vat}, where each Vj is treated as a materialised view. Let us define a 
function / : Ibase I view, which relates base-relation models to view models. 

/ is clearly a total function, since any base-relation model clearly has a view- 
model counterpart. / is not an onto function though, since some view-model 
interpretations could not be realised by the underlying base relations. 

We now define equivalence classes and equivalence class states for the query 
Ip' . An equivalence class is simply a formula over the N views 

C{X) = {<X > I (^)Fi(A) V2{X) a • • • a(^) Vn{X)}, 

where each Vj{X) can appear either positively or negatively. Examples of such 
a class include {< A > | Fi(A) A F 2 (A) A ••• A Vn{X)} and {< A > 

I ^"^Fi(A) AF2(A) A • • • A Fat(A)}. These equivalence classes can be enumerated 
as Cl, C 2 , . . . , C 2 JV . 

Using standard techniques, which we explain below, any instance which satis- 
fies Ip' can be completely described by the behaviour of these equivalence classes. 
An equivalence class Ci appears positively in a model for ip' if the formula 
3ACi(A) is true and it appears negatively if the formula ~<3Ci{X) is true. An 
equivalence class state is a partition of all the equivalence classes into two groups 
- those appearing positively and those appearing negatively. We number the pos- 
itive classes in such a state by Ci , . . . , Cm and the negative classes C( , . . . , C'^ ■ 
Clearly m-|-n 2 = 2^. Thus an equivalence class state is described by the formula 

3ACi(A) A 3AC2(A) A ... A 3AC„, (A) 

A-3AC((A) A -3AC^(A) A ... A -3AC;^(A). 

Let (pi, ■ ■ ■ , (p 22 ^ be an enumeration of the equivalence class states. Since the 
equivalence class states are mutually exclusive, it follows that ip' is satisfiable iff 
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one of ('0' A 4>i) is satisfiable. Hence to test satisfiability of '0^ it suffices to be 
able to test satisfiability of a single ip' A (pi expression. 

For the situation where we consider views in ip' A (pi as arbitrary relations, we 
observe that pi => ip' is valid if pi A p' is satisfiable (see lemma |2| in appendix). 

Therefore, given that pi is satisfiable by some realisable view instance if 
pi A p' is satisfiable by some arbitrary view instance /(, (not necessarily realis- 
able), then pi A p' is satisfiable by The converse is obviously true. Thus the 
satisfiability of piAp' can be tested by considering its truth over all possible small 
models of pi. Lemma^ shows that we can test satisfiability of a pi by considering 
all small instances using at most k = x ((to x p)™ x to x to)^'"’*'^ 

constants, which is 0(2^™^^^’"). 

Overall, to test satisfiability of p' , we need to search for satisfiable pi’s by 
testing all possible models of bounded size. This size bound then provides us 
with the upper bound of for the decision procedure. 

Hardness: In 0,it is shown that satisfiability is hard for 

first order queries over unary relations, a special case of our language. Here c is 
some constant and I is the length of the formula. Since the number of relations 
is < I, our problem is NTIME(2"”'p/^°sp) hard. ■ 

We next sketch a proof for the lemma we have just needed. The key insight 
is that for a query p, there is a bound to on the number of variables a view may 
have. This allows a complex chasing over “justifications” of constants that only 
“wraps around” at “distances” > to. (Given a set of tuples, we view a tuple as 
one generalised edge, and define distance as the length of shortest paths.) 

Lemma 1. Let p he an equivalence class state formula over N views of maxi- 
mum definition length to using p relations. If p is satisfiable, then it has a model 
using at most k = x ((to x p)™ x to x to)^"*+^ constants. 

Proof, (sketch) The proof is quite complex and so we begin with an overview. 
We first show how any model for p can be described by a structure H called a 
justification hierarchy, which links tuples in the model together according to their 
distribution in equivalence classes. We then show how to build a new justification 
hierarchy H' , which is a maximal (enlarged) version of H, has some “nicer” 
properties and is also a model for p. After this, we build another hierarchy H" 
which consists of a number of distinct copies of H' , each of which is pruned to 
a certain depth. The constants at the lowest level are justified by the constants 
at the roots of the copies. H" can be associated with a new model I" such that 
it uses no more than k constants, and I" is also a model of p. Some of the steps 
are further explained in the appendix. 

Observe that, if the only positive class in p is 3X{-^V\{X) A ... A ->y/v(Ar)), 
then p is unsatisfiable since it says (i) that there are no constants in the database 
and (ii) there must exist a constant. So we can assume that p contains some other 
positive class. 

Justification Hierarchies: Recall that the positive equivalence classes are num- 
bered Cl , . . . , Cm and observe that any constant in a model for p appears in 
exactly one equivalence class. Given an arbitrary constant ai in class Ci, there 
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must exist at least one associated justification set for the truth of Cfai). A justi- 
fication set is simply a set of tuples, in the given model, which make C'i(ai) true. 
A minimal justification set is a justification set containing no redundant tuples 
(such a set is not in general unique) . We will henceforth regard all justifications 
as being minimal. 

A justification hierarchy for an equivalence class Ci is a hierarchy of justifi- 
cation sets. At the top is a justification set for some constant a in equivalence 
class Ci- The second level consists of justification sets for the equivalence class of 
each constant appearing in the justification set for Ci (a) . The third level consists 
of justification sets for the equivalence class of each new constant appearing at 
level two etc. The hierarchy extends until every constant in the hierarchy has 
been justified. Figure 1 shows a justification hierarchy for an equivalence class 
state f in the instance I = {(1, 2), (2, 3), (3, 4), (4, 5), (5, 1)}. f is defined by 

3X{Vi{X) A V2{X)) A -i3X{Vi{X) A -^V^iX)) 

A^3A(^Vi(A) A V2{X)) A ^3A(^Vi(A) A ^V2(A)) 

where Vi{X) ^ R{X,Y) and V 2 (A) ^ R{Y,X). {f contains one positive class 
and three negative ones.) 

In general, to describe a model for <f>, we will need a set of justification 
hierarchies, one for each positive equivalence class. 

Sl= {(1,2),(5,1)) Level 0 



S2= {(2, 3), (1,2)1 S5={(5,1),(4,5)1 Level 1 



S3={(3,4),(2,3)} S4={(3,4),(4,5)) Level 2 

Figure 1: Example justification hierarchy for query ip 





Maximal Justification Hierarchies: Suppose we have a model /, and we have 
constructed exactly one justification hierarchy for each positive class. Let H 
consist of these justification hierarchies. By copying justification sets if necessary, 
we can assume that the justification hierarchies have a balanced structure. Here 
“balanced” refers to the notion that a hierarchy can be viewed as tree, whose 
edges link justification sets. 

The next step consists of renaming the constants in the hierarchy to remove 
‘unnecessary’ equalities among the constants in justifications. We build a new 
justification hierarchy set of the same height. We do this by iterating over all 
justifications: for each constant a, we replace all constants used in the justifica- 
tion of a using new constants. Call this new hierarchy the maximised justification 
hierarchy set (abbreviated MJHS). 
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Copying and Pruning Justification Hierarchies: Let Imax be a maximised 
hierarchy set for a model of (f>. Let ni be the number of constants at height m. 
Then 

ni < (no. of hierarchies in MJHS) x ((no. of views) x (max. length of a view))"* 
<2^ X {N X m)™. 

In this step, we make ni copies, denoted by /i, J 2 , , of Imax, using a 

different set of constants for each copy. Let lunion be the union of / 1 U/ 2 U. . .UJ„, . 
Then it can be verified that (j> will still be true for this new instance. Next, 
all levels below m in the hierarchy are removed and constants at level m are 
“wrapped-around” by re-justifying them with the root of one of the copies for 
the same class. Again it can be verified that f will still be true for this new 
instance. 

The number of constants in lunion is bounded by < num_copies x (max num. 
of constants in a justification hierarchy of height m + 3). Thus this number is 

<mx (2^ X (TV X to)”*+3) 

< (2^ X {N X m)”*) X (2^ x (IV x m)™+3) 

< 2^^ X (TV X to)2™+3 

< 22x("*xp)™x™ X ((m X p)”* X TO X to) 2"*+3, which is 0(2(”*xp)’"). 



Theorem^also holds for infinite models, since even if the initial justification 
hierarchies are infinite, the method used in lemma ^can still be used to exhibit 
a finite model. We thus obtain finite controllability (every satisfiable formula is 
finitely satisfiable). 

Proposition 1. The first order unary conjunctive view language is finitely con- 
trollable. ■ 

3 Extending the View Definitions 

The previous section showed that the first order language using unary conjunc- 
tive view definitions is decidable. A natural way to increase the power of the 
language is to make view bodies more expressive (but retain unary arity for 
the views). Unfortunately, as we will show, this results in satisfiability becoming 
undecidable. 

The first extension we consider is allowing inequality in the views, e.g., 
ViX)^R{X,Y),SiX,X),X^Y 

Call the first order language over such views the first order unary conjunctive"^ 
view language. This language allows us to check whether a two counter machine 
computation is valid and terminates. This leads to the following result: 

Theorem 2. Satisfiability is undecidable for the first order unary conjunetive"^ 
view query language. 



90 



James Bailey and Guozhu Dong 



Proof. The proof is by a reduction from the halting problem of two counter 
machines (2CM’s) starting with zero in the counters. Given any description of 
a 2CM and its computation, we can show how to a) encode this description in 
database relations and b) define queries to check this description. We construct 
a query which is satisfiable iff the 2CM halts. The basic idea of the simulation is 
similar to one in |^, but with the major difference that cycles are allowed in the 
successor relation, though there must be at least one good chain. See appendix 
for further details. ■ 

The second extension we consider is to allow “safe” negation in the conjunc- 
tive views, e.g. 

ViX)^R{X,Y),^S{X,X) 

Call the first order language over such views the first order unary eonjunctive~^ 
view language. It is also undecidable, by a result in |5j. 

Theorem 3. 0/ Satisfiability is undecidable for the first order unary conjunc- 
tive'^ view query language. ■ 

A third possibility for increasing the expressiveness of views would be to keep 
the body as a pure conjunctive query, but allow views to have binary arity, e.g. 

V{X,Y) ^ R{X,Y) 

This doesn’t yield a decidable language either, since this language has the same 
expressiveness as first order logic over binary relations, which is known to be 
undecidable. 

Proposition 2. Satisfiability is undecidable for the first order binary conjunc- 
tive view language. ■ 

A fourth possibility is to use unary conjunctive views, but allow recursion in 
the view bodies, e.g. 

V{X) ^ edge{X,Y) 

V\X) ^ V{X) A edge{Y, X) 

Call this the first order unary conjunctive”®'” language. This causes undecidability 
also. 

Theorem 4. Satisfiability is undecidable for the first order unary conjunctive^*^‘^ 
view language. 

Proof, (sketch) : The proof of theorem |21 can be adapted by removing inequality 
and instead using recursion to ensure there is a connected chain in succ. It 
then becomes more complicated, but the main property needed is that zero is 
connected to last via the constants in succ. This can be done by 

conn-zero(X) <— zero{X) 
conn-zero(X) <— conn-zero{Y),succ{Y,X) 

3X{last{X) A conn-zero(X)) 
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4 Applications 

4.1 Containment 

We now briefly examine the application of our results to conjunctive query con- 
tainment. Theorem Q] implies we can test whether Qi(X) C Q2{X) under the 
constraints Ci A C2 . . . A C„ where Qi, Q2, Ci, . . . , C„ are all first order unary 
conjunctive view queries. This just amounts to testing whether the sentence 
3X{Qi{X) A ^Q2{X)) a Cl a ... a Cn is unsatisflable. 

We can similarly test containment and equivalence of monadic datalog pro- 
grams under first order unary conjunctive view constraints. Monadic datalog is 
datalog with intensional relations having unary arity. To illustrate the idea, the 
conri-zero view defined above could be written 

yX{conri-zero{X) {zero{X) V tmp(X)) 
where tmp is the view 

tmp{X) <— conn-zero{X) A succ{Y,X) 

Containment of monadic datalog programs was studied in |^, where it was 
shown to be EXPTIME hard and solvable in 2-EXPTIME using automata the- 
oretic techniques. These complexity results are in line with theorem^ since the 
decision problem we are addressing is more general. 

Of course, we can also show that testing the containment Qi C Q2 is un- 
decidable if Qi and Q2 are first order unary conjunctive view^ queries, first 
order unary conjunctive view^ queries and first order unary conjunctive’'®'^ view 
queries, containment problem Q C 0. 

Containment of queries with negation was first considered in PH . There it was 
essentially shown that the problem is decidable for queries which do not apply 
projection to subexpressions with difference. Such a language is disjoint from 
ours, since it cannot express a sentence such as 3YV4{Y) A^3X (Vi{X) A^V2{X)) 
where Vi and V2 are views which may contain projections. 

4.2 Dependencies 

Unary inclusion dependencies were identified as useful in pj . They take the form 
i?[X] C S'[y]. If we allow R and S above to be unary conjunctive views, we get 
unary conjunctive view containment dependencies. Observe that the unary views 
are actually unary projections of the join of one or more relations. 

We can also define a special type of dependency called a proper unary con- 
junctive inclusion dependency, having the form Qi{X) C Q2{X). The need for 
such a dependency could arise in security applications where it is necessary 
to restrict redundancy. If {di, . . . ,dk} is a set of such dependencies, then it is 
straightforward to test whether they imply another dependency dx by testing 
the satisfiability of an appropriate first order unary conjunctive view query. 

Theorem 5. Implication is decidable for unary conjunctive view containment 
dependencies with subset and proper subset operators. ■ 
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Similarly, we can consider unary conjunctive^ containment dependencies. 
The tests in the proof of theorem 0 for the 2CM can be written in the form 
Qi{X) C Q 2 {X), with the exception of the non-emptiness constraints, which 
must use the proper subset operator. 

Theorem 6. Implication is undecidable for unary conjunctive^ (or conjunc- 
tive'^) view containment dependencies with the subset and the proper subset op- 
erators. ■ 

4.3 Termination 

The languages in this paper have their origins in P|, where active database 
rule languages based on views were studied. The decidability result for first 
order unary conjunctive views can be used to positively answer an open question 
raised in | 2 ], which essentially asked whether termination is decidable for active 
database rules using unary conjunctive views. 

5 Related Work 

Satisfiability of first order logic has been thoroughly investigated in the context 
of the classical decision problem Pj . The main thrust there has been determining 
for which quantifier prefixes first order languages are decidable. We are not aware 
of any result of this type which could be used to demonstrate decidability of the 
unary conjunctive view language. Instead, our result is best classified as a new 
decidable class generalising the traditional decidable unary first-order language 
(the Lowenheim class). Use of the Lowenheim class itself for reasoning about 
schemas is described in d. where applications towards checking intersection 
and disjointness of object oriented classes are given. 

As observed earlier, a main direction of work concerning views has been in 
areas such as containment. An area of emerging importance is description logics 
and coupling them with with horn rules m- In 0, the query containment prob- 
lem is studied in the context of the description logic DCR-reg- There are certain 
similarities between this and the first order (unary) view languages we have stud- 
ied in this paper. The key difference appears to be that although DCR-reg can 
be used to define view constraints, these constraints cannot express unary con- 
junctive views (since assertions do not allow arbitrary projection). Furthermore, 
DCRreg Can express functional dependencies on a single attribute, a feature 
which would make our language undecidable (see proof of theorem Ej) . There is 
a result in P], however, showing undecidability for a fragment of DCRreg with 
inequality, which could be adapted to give an alternative proof of theorem El 
(though there inequality is used in a slightly more powerful way). 

A recent work that deals with complexity of views is There, results are 
given for the view consistency problem. This involves determining whether there 
exists an underlying database instance that realises a specific (bounded) view 
instance . The problem we have looked at in this paper is slightly more compli- 
cated; testing satisfiability of a first order view query asks the question whether 
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there exists an (unbounded) view instance that makes the query true. This ex- 
plains how satisfiability can be undecidable for first order unary conjunctive^ 
queries, but view consistency for non recursive datalog^ views is in NP. 

6 Summary and Further Work 

Table 1 provides a summary of our decidability results. We can see that they are 
tight in the sense that the unary conjunctive view language cannot be obviously 
extended without undecidability resulting. In one sense the picture is negative 
- we cannot expect to use view bodies with negation, inequality or recursion or 
allow views to have binary arity if we are aiming for decidability. On the other 
hand, we feel that the decidable case we have identified, is sufficiently natural 
and interesting to be of practical use. 

Table 1: 

Summary of Decidability Results for First Order View Languages 



Unary Conjunctive View 


Decidable 


Unary Conjunctive^ View 


Undecidable 


Unary Conjunctive’’'^'^ View 


Undecidable 


Unary Conjunctive^ View 


Undecidable (2] 


Binary Conjunctive View 


Undecidable 



As part of future work we intend to investigate relationships to description 
logics and also look at further ways of introducing negation into the language. 
One possibility is to allow views of arity zero to specify description logic like 
constraints, such as Ri(X,Y) C R 2 (X,Y). We would also like to narrow the 
complexity gap which appears in theorem Q and gain a deeper understanding of 
the expressiveness of the view languages we have studied. 
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Appendix 

Lemma 2. Suppose we treat the unary view symbols as arbitrary base relations. 
If ijj' A 4>i is satisfiable, then (f>i ^ ip' is valid. 

Proof. (Sketch) Suppose there exists an instance / for which ip' A (pi is true. 
By deleting “extra” constants, we can transform / into a minimal model Imin 
such that each positive class contains exactly one constant. Then ip' A cpi is also 
true in Imin- (One can verify this by translating ip' A (pi into relational algebra. 
Then constants in the same equivalence class always travel together when we 
evaluate the subexpressions.) Let I' be any instance for which (pi is true. Then 
Imin is isomorphic to a submodel of I' , and we can view I' as the result of 
adding constants to the positive classes of Imin- Similar to deleting constants, 
adding constants doesn’t alter the truth value of ip' . Hence the formula (pi A ~^ip' 
is unsatisfiable, or equivalently, the formula (pi => ip' is valid. ■ 

Some Further Descriptions of Procedures Used in Lemma ^ 

Building a Justification Hierarchy Each level of the hierarchy contains 
justification sets for constants in the model for ip (call this model I). Lower 
levels justify constants which appear in higher levels. 
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The base case: Choose a constant (say ai) in Ci{I). Choose a justification set 
for tti in / (say Sa^)- Insert Sa^ at level zero. 

The recursive case: For each tuple t in the hierarchy at level i containing a 
constant Ofc such that Sa, doesn’t appear at any level < i, create a justifi- 
cation set S'afc at level i+1. 



Maximisation Procedure It is possible to define conjunctive query paths 
through a justification hierarchy between the root node and the leaf nodes. The 
distance between two constants in the hierarchy is a measure of the number of 
joins required to link these constants together. To ensure correctness, we would 
like the distance between a constant at the root node and a constant at level m 
to be TO. Since constants within a justification hierarchy may appear in several 
justification sets, this property may not be true. The following procedure replaces 
duplicate constants in the hierarchy with new ones, without compromising the 
truth of Ip. The distance requirement is then satisfied. 

Suppose that each of the hierarchies has height of at least to -I- 3. 

/* Rename */ 

For z = 1 to (to -I- 2) do 

Let a: be a constant in a justification set Sq at level z such that x ^ q. 

If X appears at a level < z then 
choose an entirely new constant k 
For each St at level z containing x and t ^ x do 
St = 

Sx at level z -I- 1 = Sa, [a:\fc]; 
end For 
end For 

Rename is a function which isomorphically substitutes new constants for all 
those currently in the hierarchy. Note that if x gets renamed to k, then Sx gets 
renamed to Sk- 



Copying and Pruning Algorithm 

Make a number of distinct copies of the hierarchies, as explained in the proof 
Call the result of the above step H' 

Suppose the constants at height to -I- 1 in H' by uq, ai, . . . , Ofc 
For j = 1 to fc 

suppose (i) Cs is the equivalence class of aj and (ii) b is the constant 
being justified at root of hierarchy for CP in I(j+i) mod k 
delete Sa^ at level to -I- 1 
Saj at level to -I- 1 = S'b[6\aj] 

End For 

remove all levels of hierarchy at height > to -I- 1 
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One can now envisage the model for tp as a, collection of interlinked justifica- 
tion hierarchies. Constants at level m of each hierarchy are justified by the root 
node of another hierarchy. 

Correctness proof for this construction is quite lengthy, and so will be deferred 
to the full version of the paper. 

Proof Details of Theorem □ A two-counter machine is a deterministic finite 
state machine with two non-negative counters. The machine can test whether a 
particular counter is empty or non-empty. The transition function has the form 

S: S X {=,>} X {=, >} S X {pop, push} x {pop, push} 

For example, the statement i5(4,=, >) = (2, push, pop) means that if we are in 
state 4 with counter 1 equal to 0 and counter 2 greater than 0, then go to state 
2 and add one to counter 1 and subtract one from counter 2. 

The computation of the machine is stored in the relation config(T, S', Ci, C 2 ), 
where T is the time, S is the state and C\ and C 2 are values of the counters. 
The states of the machine can be described by integers 0,1 ... ,h where 0 is the 
initial state and h the halting (accepting) state. The first configuration of the 
machine is configiO, 0, 0, 0) and thereafter, for each move, the time is increased 
by one and the state and counter values changed in correspondence with the 
transition function. 

We will use some relations to encode the computation of 2CMs starting with 
zero in the counters. These are: 

— So, . . . , Sh'. each contains a constant which represents that particular state. 

— succ: the successor relation. We will make sure it contains one chain starting 
from zero and ending at last (but it may in addition contain unrelated 
cycles). 

— config: contains computation of the 2CM. 

— zero: contains the first constant in the chain in succ. This constant is also 
used as the number zero. 

— last: contains the last constant in the chain in succ. 

Note that we sometimes blur the distinction between unary relations and 
unary views, since a view V can simulate a unary relation U if it is defined by 
V(X) ^ U{X). 

The unary and nullary views (the latter can be eliminated using quantified 
unary views) are: 

— halt: true if the machine halts. 

— bad: true if the database doesn’t correctly describe the computation of the 
2CM. 

— dsucc: contains all constants in succ. 

— dT: contains all time stamps in config. 

— dP: contains all constants in succ with predecessors. 

— dColi,dCol 2 '. are projections of the first and second columns of succ. 
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When defining the views, we also state some formulas (such as hasPred) over 
the views which will be used to form our first order sentence over the views. 

~ The “domain” views (those starting with the letter d) are easy to define, e.g. 

dP{X) ^ succ{Z,X) 
dColi [x) <— succ{X, Y) 
dCol 2 {X) <— succ{Y,X) 

— hasPred says “each nonzero constant in succ has a predecessor:” 

hasPred : yX{dsucc{X) (zero{X) V dP{X))) 

— sameDom says “the constants used in succ and the timestamps in config 
are the same set”: 

sameDom : yX{dsucc{X) dT{X)) /\\/Y{dT{Y) => dsucc{Y))) 

— goodzero says “the zero occurs in succ” : 

goodzero : \/x{zero{x) dsucc{x)) 

— nempty : each of the domains and unary base relations is not empty 

nempty : 3X (dsucc{X)) 

— Check that each constant in succ has at most one successor and at most one 
predecessor and that it has no cycles of length 1. 

bad <— succ{X, V), succ{X, Z),Y ^ Z 
bad ^ succ{Y, X), succ{Z, X),Y ^Z 
bad <— succ(X, X) 

Note that the first two of these rules could be enforced by the functional 
dependencies X Y and Y X on succ. 

— Check that every constant in the chain in succ which isn’t the last one must 
have a successor 

hassuccnext : yY{dCol 2 {Y) (last{Y) V dCol\{Y)) 

— Check that the last constant has no successor and zero (the first constant) 
has no predecessor. 

bad ^ last{X), succ{X, A) 
bad ^ zero{X), succ(A, X) 

— Check that every constant eligible to be in last and zero must be so. 

eligiblezero : yY{dCol\(Y) {dCohiY) V zero(Y)) 
eligiblelast : yY{dCol 2 {Y) (dColi{Y) V last(Y))) 
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— Each Si and zero and last contain < 1 element. 

bad^ S,{X),S,{Y),X 

bad ^ zero(X),zero(Y),X ^ Y 

bad ^ last{X), last{Y), X ^Y 

— Check that Si, Sj, last, zero are disjoint (0 < t < j < ft-): 

bad ^ zero{X), last{X) 
bad^ S,{X),Sj{X) 
bad ^ zero{X), Si{X) 
bad ^ last{X), Si{X) 

— Check that the timestamp is the key for config. There are three rules, one 
for the state and two for the two counters; the one for the state is: 

bad ^ config{T, S, Ci,C2),config{T, S', , C^), S' yf S' 

— Check the configuration of the 2 CM at time zero, config must have a tuple 
at (0, 0, 0, 0) and there must not be any tuples in config with a zero state 
and non zero times or counters. 

(S) ^ zero{T) ,config{T, S, _) 

{C)^zero{T),config{T, C, .) 
iC)^zero{T),confzg{T, C) 

Vy^ {T)‘^zero{S),config{T, S, _) 

^!/ci {Ci)^zero{S),config{T, S, Ci,f) 

Vy^^ {C2) ^ zero{S),config(T, S, C2) 
goodconfigzero : So(Jf)A 

(X) V (X) V Vy^ (X) V (X) V (X)) ^ zero{X)) 

— For each tuple in config at time T which isn’t the halt state, there must 
also be a tuple at time T + 1 in config. 

Vi{T) ^ config{T, S, Ci, C2), S„(S) 

V2{T) ^ succ{T,T 2 ),config{T 2 , S' ,C'^,C'2) 
hasconfignext : yT{(dt(T) A ^Vi(T)) V2(T)) 

— Check that the transitions of the 2 CM are followed. For each transition 
S{j, >, =) = (k, pop, push), we include three rules, one for checking the state, 
one for checking the first counter and one for checking the second counter. 
For the transition in question we have for the state 

Vs{T') ^ config(T, S, C\, C2), succ(T, T'), Sj{S), succ{X, Ci), zero{C2) 
V5, (S) ^ VsiT),config{T, S, Ci, C2) 
good-states ■ VS(Vjs^(S) Sfe(S)) 

and for the first counter, we (i) find all the times where the transition is 
definitely correct for the first counter 

QiAT') ^ config{T,S,Ci,C2), 

succiT, T'), Sj{S), succ{X, Ci), 
zerolc2),succ{C'f, Ci), configiT', S', C'(, C') 
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(ii) find all the times where the transition may or may not be correct for the 
first counter 

Q2s(T') ^ config(T, S,Ci,C2), succ{T, T'), Sj{S), succ{X, Ci), zero{C2) 
and make sure and Q25 are the same 

goodtranss^^ : VT(Qi^(T) Q2s{T)) 

Rules for second counter are similar. 

For transitions i 5 i, <52, • ■ • j Sk, the combination can be expressed thus: 

goodstate : goodstates^ A goodstates^ A ... A goodstates^. 
goodtransci '■ goodtranss^^^ A goodtranss^^^ A ... A goodtranss^^^ 
goodtranSc 2 '■ goodtranss^^^ A goodtransg^^^ A ... A goodtranss^^^ 

— Check that halting state is in config. 

hlt{T) ^ configiT, S, Ci, C2), 
halt : 3 Xhlt{X) 

Given these views, we claim that satisfiability is undecidable for the query 
Ip = -^bad /\hasPred/\sameDom/\halt/\goodzero/\goodcon fig zero /\/\nempty /\ 

hassuccnextAeligiblezeroAeligiblelastAgoodstateAgoodtranSci/\goodtranSc2 A 
hasconfignext ■ 
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Abstract. Yao’s formula is one of the basic tools in any situation where 
one wants to estimate the number of blocks to be read in answer to some 
query. We show that such situations can be modelized by probabilistic 
urn models. This allows us to fully characterize the distribution prob- 
ability of the number of selected blocks under uniformity assumptions, 
and to consider extensions to non-uniform block probabilities. We also 
obtain a computationnally efficient approximation of Yao’s formula. 



1 Introduction 

Evaluating the performances of any database is an intricate task, with many 
intermediate computations. One frequent step consists in evaluating the number 
of memory blocks that must be read in order to obtain a specified subset of some 
set, for example by a selection. A first answer was given by Yao in at least 
for the expectation of the number of selected blocks, and under uniformity and 
independance assumptions. 

The mathematical treatment of Yao was based on a very simple expression for 
the number of ways of choosing i objects among j; this allowed him to obtain 
the average number of desired blocks, but he did not characterize further the 
probability distribution. Getting enough information on the probability distri- 
bution is important, because the mean value is not enough to characterize a 
random variable : If substituting the mean value for the random variable itself, 
when computing some query costs, certainly allows fast evaluation of some costs 
in a situation where the accent is not on the detailed study of the number of 
retrieved blocks, but rather on quick computations for choosing an execution 
plan, at the same time one must have some confidence that using this average 
value will not induce too much error and lead us to choose a wrong plan! 

Extending Yao’s approach to try and get more information on the distribution, 
such as its higher order moments, would quickly lead to intricate computations, 
which may finally succeed in giving a mathematical expression in the uniform 
case, but which would probably not express the underlying mathematical phe- 
nomenon in an intuitive, easy to understand form, and which would fail when 
considering non-uniform distributions on blocks. We shall see that, when using 
the right framework (random allocations and probabilistic urn models), it is easy 
to obtain all the desired information on the random variable number of retrieved 



Catriel Beeri, Peter Buneman (Eds.): ICDT’99, LNCS 1540, pp. 100-^^] 1998. 
(c) Springer- Verlag Berlin Heidelberg 1998 



Urn Models and Yao’s Formula 



101 



blocks; as a consequence, we can check that in many situations the random vari- 
able is quite close to its expectation, which justifies the use of this expectation 
instead of the random variable itself. More precisely, we give the variance of 
the number of selected blocks, and we can obtain higher order moments, if de- 
sired. However, this seems less useful that getting the approximate distribution, 
for large numbers of blocks m and of selected objects n : This distribution is 
Gaussian, and its mean and variance are very easy to computeQ 

Our first mathematical tool is a random allocation model that is an extension 
of a well-known occupancy urn model : When allocating a given number of balls 
into a sequence of urns, what can we say about the random variable number of 
empty urns ? Our second tool is a systematic use of the methods of the analysis of 
algorithms, i.e. generating functions, and asymptotic approximations by complex 
analysis. 

Numerical computations show that our approximation of the expectation is both 
very close to the exact value and much quicker to compute than Yao’s formula : 
In many situations it is not worthwhile to use the exact formule , which differs 
from our approximation by a negligible amount, and which requires a significant 
computation time. We compute our approximation of Yao’s formula with a con- 
stant number of operations, while the exact formule requires 0{n) operations. 
Another interesting consequence is that our mathematical model easily lends 
itself to extensions such as non-uniform distributions. We consider in this paper 
piecewise distributions, and show that we can extend our results to obtain again 
an asymptotic Gaussian limiting distribution, whose mean and variance can be 
efficiently computed. 

The plan of the paper is as follows : We present in Section El our urn model 
for the classical problem (uniform placement of objects in blocks), then study 
the distribution of the number of selected blocks. We discuss in Section 01 the 
adequacy of an extension of Yao’s formula (presented in mn) to the case 
where the probabilities of blocks are no longer uniform. This leads us to an 
extension to piecewise distributions in Section 01 We also give in this section a 
sketch of our proofs; we refer the interested reader to ITTN^ for detailed proofs. 
We conclude by a discussion of our results and of possible extensions in Sectional 
A preliminary version of these results is presented in \rm\ . 

2 Yao’s Formula Revisited 

2.1 Notations and Former Result. 

Gonsider a set £ of objects, whose representation in memory requires a specified 
number of pages. A query on the set £ selects a subset tF of £; now assume that 
the cardinality of the answer set iF is known; how many pages must be read to 

^ We recall that a Gaussian distribution is fully determined by its first two moments, 
i.e. by its expectation and its variance. 
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access all the objects of It is usual to assume that placement of the objects of 
T is done randomly and uniformly : each object of T has the same probability 
to be on any page. We shall use the following notations : 

— the total number of objects is p, these p objects are on m memory pages; 
each block contains b = pjm objects (in this section we assume that all the 
blocks have the same capacity); 

— the query selects n objects; 

— the number of blocs containing the n objects selected by the query is a 
random variable with integer values X . 

Yao gave in the expectation of X, by computing the probability that a 
given page is not selected : This probability is equal to the number of allocations 
of the n selected objects among m — 1 pages, divided by the total number of 
configurations, i.e. by the number of allocations of n objects among m pages; 
hence 

, 1 ) 



2.2 Occupancy Urn Models. 

This familiar problem lends itself to a detailed probabilistic analysis, based on 
a random allocation model that is a variation of one of the basic urn occupancy 
models. We refer the reader to the survey book of Johnson and Kotz |JK| for a 
general presentation of urn models and give directly the urn model that we shall 
use : 

Take a sequence of m urns and a set of n balls, then allocate an urn to 
each ball. The trials are independent of each other; at each trial all urns 
have the same probability 1 /m to receive the ball. When the n balls are 
allocated, define the random variable X as the number of empty urns. 

The translation to our database problem now should be obvious : The m urns 
are the pages; the n balls are the selected objects; the empty urns are the pages 
that contain no selected object, i.e. that won’t be accessed. A few years before 
Yao, Cardenas |Ca| gave a formula for the expectation of X that is a direct 
application of the classical urn model : 

EIX] = m (l - (l - 1)") . (2) 

Much is known for this model : exact formulae for the moments of X and for the 
probability distribution, asymptotic normality and convergence towards a Gaus- 
sian process in what Kolchin et al. [K^ call the central domain, where the ratio 
n/m is constant, or at least belongs to an interval [01,02], 0 < oi < 02 < -l-oo. 
However, there is a difference between the database situation we want to model 
and the urn model : We have imposed no limitation on urn capacities, while 
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pages have a finite capacity b. Assuming that pages have unbounded capacity is 
not realistic from a database point of view, and we turn now to an extension of 
the model where each um can receive no more than b balls. 

What happens to the assumptions of independence and uniformity when consid- 
ering bounded urns? It seems obvious that, if urns have a finite number b of cells, 
each of which can receive exactly one ball, and if at some point an urn U\ is still 
empty while an urn U 2 has received one or more balls, the probability that the 
next ball will be allocated to Ui is no longer equal to the probability that it will 
be allocated to C/ 2 . However, we may still consider that, at any time, the empty 
cells have the same probability to receive the next ball; hence the probability 
of any urn at that point is proportional to the number of its empty cells. Now 
the independence assumption becomes : The allocation of any ball into a cell is 
independent of former allocations. In other words, independence and uniformity 
still hold, but for cells instead of urns. Returning to our database formulation, 
a cell is one of the objects of the set 5, and the objects that are selected for 
the query are chosen independently of each other, and independently of their 
placement on pages, which is exactly the usual assumption. 

2.3 Analysis of the Bounded- Capacity Urn Model. 

Let X be the random variable number of selected blocks, or of non empty urns. 
We introduce now the generating function enumerating the set of possible al- 
locations of balls into urns, which will give us a tool for studying the prob- 
ability distribution of X. Define Npn as the number of allocations of n balls 
into I of the m urns, in such a way that none of the I urns is empty, and de- 
fine F{x,y) — y^/n\ : We use the variables x to “mark” the non 

empty urns, and y to mark the balls. The probability that an allocation gives I 
urns with at least one ball, conditioned by the number n of balls, the average 
number of non empty urns, and its variance all can be obtained by extracting 
suitable coefficients of the function F{x,y) or of some of its derivatives. We 
compute F{x,y) by symbolic methods, and obtain, when desired, approximate 
expressions of coefficients by asymptotic analysis; see for example ESI for an 
introduction to these tools. We have that 

F{x,y) = {l + x{{l + y)^-l)r. (3) 

The probability that I blocks are selected, the average number of selected blocks 
and its variance are obtained from the coefficients of F{x, y) and of its deriva- 
tives : 



Pr {l/n) 



[x’-y^]F{x,y) 





( 4 ) 



E[X] = m 



( 5 ) 
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/p-2b\ 

\nJ 





( 6 ) 



For large m and n, the formulae J3) have an approximate expression in terms 
of the ratio n/m. Although it is possible to consider an arbitrary relationship 
between the orders of growth of n and m, we shall limit ourselves to the case 
where n/m is constant; in other words we are in the central domain of Kolchin 
et al. We give below approximate values for the moments E[X] and cr^[A]; we 
shall see later on (see Section iZAi that these approximations are very close to 
the actual values. 



E[X] 





a 



2 



[A] - m ( 




b 




( 7 ) 

( 8 ) 



What about the limiting distribution in the central domain? Applying former 
results insj, we see that the limiting distribution of the random variable X exists, 
and is a Gaussian distribution whose mean and variance satisfy the formulae dzj 
and ( 0 . 



2.4 Numerical Applications. 

We first evaluate the possibility of using the approximate formula 0 instead of 
Yao’s exact formula, whose computation is much more complicated and requires 
a time that cannot be neglected. We have fixed the total number of blocks at 
m = 500, and varied the size b of the blocks. For comparison purposes, we have 
also indicated the values corresponding to the infinite case, which give a lower 
bound for the expected number of blocks. Table 1 gives the numerical results, 
which are also the basis for the plots of Figure 1. The exact and approximate 
values are very close to each other; as a consequence, in the sequel we shall no 
longer bother with the exact values, but rather use the approximate expressions. 
However, as was already noticed by Yao, the infinite urn case is not a good 
approximation, at least for low sizes b of blocks. 

Our next objective was to check the quality of the Gaussian approximation. We 
give in Figure 2 the exact distribution of the number of selected blocks, as well 
as the Gaussian distribution with the same mean and variance. As the former 
values of m and n would have given us some extremely low probabilities, we have 
chosen here m = 100 and 6 = 5, with a total of n = 200 balls. For these rather 
small values of m and n, we can check that the approximation by a Gaussian 
distribution is already quite good; of course it is still better for larger values of 
the parameters m and n. 
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n 


Yao (exact) 


Yao (approximate) 


Unbounded urns 
(Cardenas) 


b=2 


b=5 


cr 

II 

o 


b=2 


b=5 


O 

t-H 

II 


200 


180.0 


170.5 


167.7 


180 


170.4 


167.5 


164.9 


400 


320.1 


291.0 


282.9 


320 


290.8 


282.8 


275.5 


600 


420.1 


373.3 


360.9 


420 


373.2 


360.7 


349.5 


800 


480.0 


427.4 


412.7 


480 


427.3 


412.5 


399.2 


1000 


500 


461.2 


446.4 


500 


461.1 


446.3 


432.4 



Table 1. Number of selected blocs, for a uniform distribution and several sizes 
of blocks 



We can now be reasonably sure that the random variable number of expected 
blocks behaves as a Gaussian distribution, which has some important conse- 
quences: For a Gaussian distribution, we can quantify precisely the probability 
that we are at a given distance from the average value E, using the variance 

It is worth noting that our asymptotic formulae make for very quick compu- 
tations, in contrast to Yao’s formula. For n = 1000 the computation of Yao’s 
formula requires 2.5s, the bound obtained by assuming that the urns are infinite 
is 0.15s, and the asymptotic formula is obtained in at most 0.017s. Our formula 
need a constant number of operations when n and m increase. Yao’s formula 
requires n divisions and n — 1 multiplications for the same computation. 



3 How Well-Fitted to Object Databases Is Yao’s 
Formula? 

It may happen that the database has different block sizes, or that the objects 
in the database are clustered according to the values of some attribute; then 
the probability that a block is selected is no longer uniform, and Yao’s formula 
no longer gives a valid estimation of the average number of selected blocks. 
An extension of Yao’s formula to deal with such a situation was proposed by 
Gardarin et al. in mni], where they consider blocks of different sizes in an 
object-oriented database, and suggest that Yao’s formula be applied on each 
group of equal-sized blocks: 

Let C be a collection of j partitions, where each partition has pi objects and 
requires blocks. Assume that n objects are chosen at random in the collection 
C] then the average number of selected blocks is 

f 1 j , with rii = n (9) 

The proof of this formula relies on the assumption that the placement of objects 
into blocks is uniform, and that each object has the same probability of beeing 
selected, independently of the others; as a consequence the number of selected 
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Fig. 1. Number of selected blocks, plotted against the number of selected objects, 
for several sizes of blocks 



objects in the i-th partition is n*pi/p. We argue that this does not always hold, 
for example in the simple example we give below. 

Consider a database containing information relative to persons, and the query Q 
“Find all people having a salary at least equal to 1000 $”. If we know the number 
n of persons satisfying this query, and if the placement of persons on pages 
(blocks) is random, then the average number of blocks to retrieve is given by 
Yao’s formula. Now assume that the database also contains data relative to cars, 
that each car has an owner, and that the persons are partitioned according to 
their owning, or not, a car : some blocks will contain data relative to the persons 
who own a car, and to their cars, and others blocks will contain data relative 
to the persons who do not own a car. Then we have introduced a correlation 
between the salary and ownership of a car : We expect the proportion of people 
satisfying our query Q to be larger among car owners than among the other 
people. This means that the blocks containing data on car owners have a greater 
probability of being selected than the other blocks. 

The formula (0) does not take into account this phenomenon, and may lead to 
biased results, all the more when clustering is done according to an attribute 
that appears in the query. In the following section, we consider the case where 
the probability that a block is selected can take a finite number of values, i.e. 
where we can partition the blocks according to their probability. 
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Fig. 2. Probability distribution of the number of selected blocks and of the 
Gaussian distribution with same mean and variance 

4 Piecewise Uniform Distributions 

We now define a model to deal with non-uniformity of database objects or of 
blocks. We shall assume that the probability distribution on the blocks is piece- 
wise uniform, and that the objects are stocked in blocks that may have different 
capacities. This leads us to extend our urn model to allow different probabilities 
for urns. On each trial, the ball falls into the urn U with a probability Pr (C/). 
We should mention at once that this assumption does not contradict our former 
assumption that urns have a finite capacity : The conditional probability that an 
urn U receives the next ball, knowing that it already holds j balls, is of course 
related to j! 

We assume that there are i distinct kinds of blocks (1 < i < j); there are rrii 
blocks of the kind i, each of which holds bi objects. We translate this into an 
urn model, where the urns of type i each have bi cells, and where each cell can 
hold at most one ball. 

The total number of blocks (urns) is to = the cumulated number 

of objects (balls) in blocks of type i is rmbi, and the total number of objects in 
the database is p = J2i<i<j We shall denote by the probability that a 
selected object is stocked in a block of type z; of course X)i<i<j This 
can be illustrated using the example of Section 3, where we separate the persons 
owning a car from those who don’t: tti would be the probability of having a car, 
knowing that the person earns more than 1000 $ : tti = Pr {Car / Sal > 1000 $) 
and 7T2 = Pr (\Car/Sal > 1000 $) = 1 — tti. 
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4.1 Theoretical Results. 



In this part, we characterize the probability distribution of the random variable 
number X of selected blocks, conditioned by the number n of selected objects; a 
brief sketch of the mathematical proofs can be found in the section H.dl 

When we consider two types of blocks (j = 2), we have exact expressions for the 
mean and variance of X : 



Proposition 41 Let 




( 10 ) 



The average number of selected blocks, conditioned by the number n of selected 
objects, is 



E[X] = TO - 



/ /(toi- 1,TO2) /(mi,TO2-l)\ 

V ^ /(wi,TO2)) /(toi,TO2) j 



( 11 ) 



This formula can be extended to give the variance, and to include general 
j [Gl\l2j , but the usefulness of such an extension may be discussed : We have seen 
that exact computation of Yao’s formula, which involves a binomial coefficient, 
is already costly, and we have here sums of products of such terms! We shall 
rather concentrate our efforts on obtaining asymptotic expressions, which can 
be computed with a reasonable amount of effort, and which give good accuracy 
in many cases, as we shall see in Section 14.21 

We now turn to the asymptotic study, and notice at once that, in difference to 
the uniform case where we had to deal with two variables n and to, we now have 
to deal with j + 1 variables : the number n of selected objects and the numbers 
TOi of blocks of type z, for the j types. These variables may grow at different 
rates; moreover we have 2j parameters tt^ and bi. We shall limit ourselves to the 
easiest generalization of the uniform case : The number mi of blocks of each type 
is proportional to n. Under this assumption, we have the following result : 



Theorem 41 When the number of blocks of each type is proportional to the 
total number n of selected objects, the random variable X asymptotically follows 
a Gaussian limiting distribution, with mean and variance 



E[X] 



TTH ^ 

^ (1 + -K^pIbimiY' ’ 



E 

i=l 



1 - 



(1 + TTjp/biTOi)'’- V {l + 'Kip/b^miY 



(eL 



— 1 (l+TTip/biTni)^ 






(12) 



( 13 ) 



Urn Models and Yao’s Formula 



109 



In these formulae, p is a function ofn, and is the unique positive solution of the 
equation in y: + '^iV/bimf) = 1. 



4.2 Numerical Applications. 

We have first considered two types of urns {j = 2), with mi = 100 and hi = 10 
for the first type, m 2 = 200 and 62 = 15 for the second type. We have also 
considered different values for the probabilities tti and 7 T 2 that an object belongs 
to a block of the first or second type. Table 2 gives the average value of X in 
several cases : The first three columns give the exact values for different tti, 
the next three columns the approximate values under the same assumptions, 
and the last column is obtained by the formula (jOI) of Gardarin et al (which 
does not allow us to consider different probabilities tti). When the number n 
of selected objects is relatively low, the last formula gives results that may be 
considered as valid in the first case, but that differ sensibly from the actual 
ones in the other cases, where the probability distribution is more biased. (For 
n greater than 1500, almost all blocks are selected anyway, and the different 
formulae give comparable results.) Another noteworthy remark is that, here 
again, the asymptotic approximation is of quite good quality, and that the time 
for computing this approximation is far lower than the corresponding time for the 
exact case. The time necessary to compute /(mi, m 2 ), and then E\X], is 0(n) for 
the exact formula, and 0(1) for the asymptotic formula approximation. For the 
numerical examples we have considered (see Table 2), the computation time of 
the asymptotic formulae is at most 0.017s, the formula of Gardarin at al. (which 
leads to somewhat imprecise results) requires a few seconds, and the time for 
the exact formula can be as much as a few minutes! As a consequence, we have 
used the approximate formulae for the curves of Figure 3. a, which presents the 
average number of selected blocks for probabilities tti = 9/10 and 7 T 2 = 1/10; 
we have also plotted the estimation from Gardarin et al., which is always larger 
than the exact result. 



n 


Exact values 


Approximate values 


Uniformity 


TTl = 1/4 
7T2 =3/4 


7Tl = 4/5 
7T2 = 1/5 


TTl = 9/10 
7T2 = 1/10 


TTl = 1/4 
7T2 =3/4 


7Tl = 4/5 
7T2 = 1/5 


TTl = 9/10 
7T2 = 1/10 


200 


147.6 


122.2 


108.1 


147.4 


122.1 


108.0 


147.8 


400 


224.1 


178.3 


148.5 


224.0 


178.4 


148.5 


224.3 


600 


263.0 


218.2 


182.7 


262.8 


218.2 


182.7 


263.1 


800 


282.3 


249.7 


218.7 


282.2 


249.6 


218.7 


282.4 


1000 


291.7 


272.5 


251.8 


291.7 


272.4 


251.8 


291.8 


1200 


296.2 


286.8 


276.1 


296.2 


286.7 


276.1 


296.3 


1500 


298.9 


296.6 


294.0 


298.9 


296.6 


293.9 


298.9 


2000 


299.9 


299.8 


299.7 


299.9 


299.8 


299.7 


299.9 



Table 2. Average number of selected blocks for two types of blocks with different 
probabilities 
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Fig. 3. a: Average number of selected blocks, for two types of blocks with prob- 
abilities 7Ti = 9/10 and 7T2 = 1/10, plotted against the same number under a 
uniform distribution, b: Three types of blocks 



The average value of A is a simple function of the parameter p, which is obtained 
by solving a polynomial equation of degree j whose coefficients are expressed 
with the parameters n, m^, bi and If a comparatively simple expression of 
p exists when j = 2, such a formula either becomes quite complicated, or does 
not exists, for larger j, and we then use a numerical approximation of p. We 
present in Figure 3.b results for three types of blocks, which require a numerical 
approximation of the solution p. Here again, we have plotted the results against 
the result from Gardarin et al. in |GG 1] , and have checked that our formulae give 
a lower average number. We can check that the simplification that allowed them 
to compute their formula, by making strong assumptions of uniformity, gives a 
result that is always greater than the exact one; such a systematic overestimation, 
as noted long ago by Christodoulakis HIE 2 I, can lead to errors in the further 
process of optimization. 

4.3 Sketch of Proofs. 

We give here the main ideas and the sketch of the proofs of our theorems; 
the interested reader can find the complete proofs and detailed computations 
in [CH3- We start from the generating function F{x,y), with x marking the 
number of non empty urns, i.e. of selected blocks, and y marking the total 
number of balls in the m urns : 

F{x,y) = '\\{x+ {l + T:^yf' (14) 

i=l 

The right coefficients of the generating function and of its derivatives give us 
the desired information on the probability distribution, for example its value at 
some point, its average value and its variance. To obtain the limiting distribu- 
tion, we use Levy’s theorem on the convergence of characteristic functions. The 
characteristic function of the distribution of A, conditioned by the number n of 
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selected objects, is obtained from the quotient of the two coefficients [y'^]F{x, y) 
and [y'^]F{l,y). We approximate these coefficients by a saddle point method, 
treating a: as a parameter; in a second step we choose x = to obtain 

the characteristic function, and check that it converges for large n towards the 
function of a Gaussian distribution of suitable expectation and variance. 

An alternative proof is obtained by applying recent results of Bender and 
Richmond to our generating function F{x,y). 

5 Discussion and Extensions 

We have shown that an adequate modelization allows us to compute much more 
than the average number of blocks to be retrieved, in a situation where all the 
blocks have the same probability to be selected. We have computed the variance 
of this number and its limiting distribution. In our opinion, the points that are 
relevant to practical applications are the following : The limiting distribution 
is Gaussian with mean and variance of the same order, the exact distribution 
is close to the limiting one for reasonable values of the parameters, and the 
approximate values of the expectation and variance are given by very simple 
formulae. This means that, when in a situation where one wants to use Yao’s 
formula, one can use the average value with confidence that the actual number 
is not too far from it, and that we do not even need to use the exact average 
value (whose computation, involving binomial coefficients, i.e. factorials, can be 
costly), but can use a very efficient approximation. 

We have also shown that extensions of Yao’s formula to non-uniform probabilities 
are best done by the urn model. For piecewise distributions, we have given 
the exact and approximate mean and variance, and proved that the limiting 
distribution is again Gaussian : Once more, we are in a situation where using 
the expectation instead of the random variable is not likely to induce much error. 

Former results on non-uniform distributions are few, and are much less precise 
than ours. To the best of our knowledge, they have been limited to the ex- 
pectation of the number of retrieved blocks: For example, Ghristodoulakis gave 
in IDEU upper and lower bounds for this expectation. Our approch differs from 
the usual one in that we do not assume independency of the variables under 
study, although allowing dependencies may lead to greater generality, we still 
have to define and obtain conditional probabilities (the tt^ of Section 4), which 
may limit the use of our results. 

The fact that most of our results are asymptotic, i.e valid for large numbers of 
blocks and of selected objects, may at first glance seem a restriction. However, 
numerical computations have shown that the exact and approximate values are 
very close to each other for parameters as low as a few hundred, which is certainly 
relevant to most databases. If this is not the case, then one should use the exact 
formulae, which still give feasible computations for low values of the parameters. 
If one wishes for a mathematical justification of our approximations, then one 
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might try and quantify the speed of convergence towards the limiting distribution 
(reasonably fast in urn models). 

From a mathematical point of view, there is still work to be done : If urn models 
with uniform probabilities (and unbounded urns) are quite well-known, to our 
knowledge the extensions relative to bounded urns and non-uniform probabilities 
are not so well characterized. An interesting problem is to define a class of 
probability distributions on the urns that still give rise to a Gaussian behaviour, 
and with a variance low enough that the bound on the error when using the 
average value instead of the random variable remains within acceptable bounds. 
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Abstract. A large number of database index structures have been pro- 
posed over the last two decades, and little consensus has emerged re- 
garding their relative effectiveness. In order to empirically evaluate these 
indexes, it is helpful to have methodologies for generating random queries 
for performance testing. In this paper we propose a domain-independent 
approach to the generation of random queries: choose randomly among 
all logically distinct queries. We investigate this idea in the context 
of range queries over 2-dimensional points. We present an algorithm 
that chooses randomly among logically distinct 2-d range queries. It 
has constant-time expected performance over uniformly distributed data, 
and exhibited good performance in experiments over a variety of real and 
synthetic data sets. We observe nonuniformities in the way randomly 
chosen logical 2-d range queries are distributed over a variety of spatial 
properties. This raises questions about the quality of the workloads gen- 
erated from such queries. We contrast our approach with previous work 
that generates workloads of random spatial ranges, and we sketch direc- 
tions for future work on the robust generation of workloads for studying 
index performance. 



1 Introduction 

Multidimensional indexing has been studied extensively over the last 25 years; a 
recent survey article describes over 50 alternative index structures for the two- 
dimensional case alone. Two-dimensional indexing problems arise frequently, es- 
pecially in popular applications such as Geographic Information Systems (GIS) 
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and Computer Aided Design (CAD). A frustrating aspect of the multidimen- 
sional indexing literature is that among the many proposed techniques, there is 
still no “clear winner” even for two-dimensional indexing. Performance studies 
that accompany new index proposals typically offer little help, presenting confus- 
ing and sometimes conflicting results. Significantly absent from many of these 
studies is a crisp description of the distribution of queries that were used for 
testing the index. The need for rigorous empirical performance methodologies in 
this domain has been noted with increasing urgency in recent years 



Recent work on generalized indexing schemes presents software and analytic 
frameworks for indexing that are domain-independent, i.e., applicable to arbi- 
trary sets of data and queries men]. As noted in nm, there is a simple logical 
characterization of the space of queries supported by an index over a data set 
D-. they form a set S' C P{D), i.e., a set of subsets of the data being indexed. 
Note that this logical view of the query space abstracts away the semantics of 
the data domain and considers only the membership of data items in queries 
- a query is defined by the set of items it retrieves. This abstraction leads 
to simplified systems mi, frameworks for discussing the hardness of indexing 
problems [1 1 1 II I tl I . and domain-independent methodologies for measuring the 
performance of queries over indexes 



A natural extension of this idea is to test indexes in a similarly domain- 
independent manner, by choosing randomly from the space of logical queries. In 
particular, a random logical query is simply a randomly chosen element of the set 
S C P{D) of queries supported by the index. In this paper we consider randomly 
generating logical queries in this fashion for indexes that support range queries 
over two-dimensional points. 



We begin by presenting a simple algorithm for generating random logical 2-d 
range queries, and we study its performance both analytically and empirically. 
While in the worst case the algorithm takes expected time 0{n^) for databases 
of n points, in the case of uniformly distributed point sets it runs in constant 
expected time. We conducted experiments over standard geographic databases, 
and over synthetic data sets of various fractal dimensions: in all these cases, the 
running time of the algorithm was within a factor of two over its expected time 
on uniformly distributed points, suggesting that the algorithm performance is 
efficient and robust in practice. 



We continue by considering the spatial properties of randomly generated 
logical queries. We note that the queries in a randomly generated set of logical 
2-d range queries, although likely to be diverse with respect to the set of points 
they contain, may not be diverse with respect to natural properties such as 
area, cardinality, and aspect ratio. For example, given a data set consisting of 
uniformly distributed 2-d points, the expected area of a random logical range 
query over those points is 36% of the total data space (the unit square), and the 
variance is only 3%. Thus simply testing a data structure using random logical 
range queries will shed little light on how the data structure performs on queries 
of differing area. Moreover, doing so is unlikely to expose the performance of the 
data structure on “typical” queries, which are likely to be more selective. 
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To illuminate this issue further, we contrast these observations with proper- 
ties of previously-proposed query workloads. These workloads provide explicit 
control over domain-specific properties like cardinality and area, but are not 
necessarily diverse with respect to logical properties like the set of points they 
contain, or the cardinality of queries. Thus these workloads may not provide good 
“coverage” of an index’s behavior in different scenarios, and may also not be rep- 
resentative of “typical” real queries either. These observations raise a number 
of research issues, which broadly amount to the following challenge: in order to 
allow experimentalists to do good work analyzing the performance of indexes, we 
need better understanding and control of the techniques for generating synthetic 
workloads. 




ri 



Fig. 1. Random Rectangle vs. Random Query Rectangle: the outer and inner 
rectangles represent the same query, though they are distinct rectangles. 



1.1 Related Work 

Previous work on spatial index benchmarking used queries generated from do- 
main-specific distributions, based on geometric properties (area, position, aspect 
ratio, etc.) For example, many studies generated random rectangles in the plane, 
and used these to query the index. Note that the spaces of random rectangles in 
the plane and random queries are quite different: this is illustrated in Figure Q 
From a spatial perspective, ri and r 2 are distinct two-dimensional rectangles. 
From a strictly logical perspective, however, ri and r 2 are identical queries^ 
since they describe the same subset of data. We are careful in this paper to 
distinguish random rectangles in the plane from random query rectangles, which 
are chosen from the space of logically distinct rectangular range queries. It is 
inaccurate to consider random rectangles to be good representatives of randomly 
chosen logical queries; we discuss this issue in more detail in Section El 

The original papers on many of the well-known spatial database indexes use 
domain-specific query benchmarks. This includes the papers on R-trees |0|, R*- 
trees | 2 |, Segment Indexes and hBiT-trees 0 . For some of those papers, 
the construction of the random rectangles is not even clear; for example, the 
R-tree paper describes generating “search rectangles made up using random 
numbers... each retrieving about 5% of the data”, and goes into no further de- 
tail as to how those rectangles are chosen. The hBTT-tree authors gave some 
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consideration to the relationship of queries and data set by forming each rect- 
angle “by taking a randomly chosen existing point as its center” a similar 
technique was used in Domain-specific techniques were also used in spatial 
index analysis papers, including Greene’s R-tree performance study 0, and the 
various papers on fractal dimensions m (which only used square and radius 
query ranges). The domain-specific approach has been studied in greater detail 
by Pagel and colleagues. Their work provides an interesting contrast to the work 
presented here, and we discuss this in depth in Section |3 

1.2 Structure of the Paper 

In Section |2| we present the algorithm for generating random logical 2-d range 
queries. We also provide analytical results about expected running time over any 
data set, and expected running time over uniformly distributed data. In Section 0 
we present results of a performance study over other data distributions, which 
produces average running times within a factor of 2 of the expected time over 
uniform data. Section 0 considers the spatial properties of randomly chosen range 
queries, and in Section 0 we discuss the properties of rectangles generated by 
prior algorithms that explicitly considered spatial properties. SectionElreflects on 
these results and considers their implications for the practical task of empirically 
analyzing index performance. 




Fig. 2. The minimal bounding rectan- 
gle containing points a, b, c, d and e is 
shown. 



Fig. 3. Both {a,b,c,d} and 
{a, e, c, d} represent this rectangle 



2 An Algorithm for Generating Random 2-d Range 
Qneries 

2.1 Preliminaries 

Definition 1. Let S = {(xi, j/i), . . . , (a;„, ?/„)} be a set of points in the plane. 
The minimal bounding rectangle (MBR) containing S is 

[minjxi, . . . ,x„},max{xi, . . . ,x„}] x [minlyi, . . max{?/i, . . . ,y„}] 

The rectangle represented by S is the MBR containing S. A query rectangle, 
with respect to a data set I, is a rectangle that is represented by a set SCI. 
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The set of all distinct query rectangles corresponds precisely to the set of logically 
distinct rectangular region queries. 

Although distinct sets of points can represent the same rectangle, we define 
the canonical set of points representing a rectangle (with respect to a data set) 
as follows. 

Definition 2. Let I he a data set and let q he a rectangle represented hy a 
subset of I. The canonical top point of q (with respect to I) is the leftmost point 
in I lying on the top boundary of q. Similarly, the canonical bottom point is 
the leftmost point lying on the bottom boundary. The canonical left and right 
points of q are the topmost points lying respectively on the left and right and 
right boundaries of q. 

S is the canonical set of points representing q if S consists of the canonical 
bottom, top, left, and right points of q. 



Definition 3. There are four types of query rectangles: Tpoint, 2-point, 3-point, 
and 4-point. An z-point query rectangle is one whose canonical set consists of i 
distinct points. 

A 4-point rectangle has one data point on each of its boundaries. A 3-point 
rectangle has one data point on a corner of the rectangle, and one point on each 
of the two sides that do not meet at this corner. A 2-point rectangle is either 
a line segment (a degenerate rectangle with zero height or width) with a data 
point on each end, or a non-degenerate rectangle with data points on 2 opposite 
corners. A 1-point rectangle is a single point (a degenerate rectangle, with no 
height or width). 

Definition 4. For data set I, the logical distribution on rectangular queries is 
the uniform distribution on the set of all distinct query rectangles represented 
by subsets of I. A random query rectangle is a query rectangle that is generated 
according to the logical distribution on rectangular queries. 



2.2 Approaches to Generating Random Query Rectangles 

We consider the problem of generating random query rectangles. It is easy to 
generate a random rectangular region [x\,X 2 ] x [y\,y 2 ], chosen uniformly from 
the space of all such regions. Generating a random query rectangle is not as easy. 
Some of the most natural approaches to generating random query rectangles do 
not work, or are impractical. Consider for example the idea of generating query 
rectangles by choosing four data points at random, and taking the minimal 
bounding rectangle containing those points. Under this scheme, a rectangle which 
has one data point on each of its four sides could only be generated by choosing 
precisely those four points. In contrast, a rectangle with two data points at 
opposite corners, and many data points in its interior, could be generated by 
choosing the two corner points, and any two points from its interior. As a result, 
this scheme will be biased towards the latter type of rectangle. Other techniques. 
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such as “growing” or “shrinking” random rectangles until a query rectangle is 
achieved, have similar biases. 

A naive approach to avoiding such bias is to generate all query rectangles 
and pick one uniformly from among them. However, since there can be 9{n^) 
query rectangles, this approach is grossly impractical for sizable databases. A 
somewhat less naive method, which uses a range tree, correctly generates random 
query rectangles, but requires 0(n^ logn) preprocessing time for a preprocessing 
stage, and 0(log n) time to generate each query following the preprocessing stage. 
The method uses 0{n'^ logn) storage (we omit the details of this method due to 
space limitations). Since n is typically large this is still impractical, and we omit 
the details of this method here. It is an interesting open question whether there 
is a scheme for generating random query rectangles that in the worst case uses 
O(nlogn) space, preprocessing time O(nlogn), and time O(logn) to generate 
each random query rectangle following the preprocessing stage. The method we 
present in the next section does not achieve these bounds in the worst case, but 
it does work quite well in practice. 

2.3 The Algorithm 

We present a simple Las Vegas algorithm for generating random query rectangles. 
The amount of time it takes to run can vary (because of randomness in the 
algorithm), but when it terminates it produces a correct output — a random 
query rectangle. 

Our algorithm for generating a random query rectangle first chooses, with 
an appropriate bias, whether to generate a random 1-point, 2-point, 3-point, or 
4-point rectangle. It then generates a rectangle of the appropriate type, chosen 
uniformly from all query rectangles of that type. We present the pseudocode in 
Figure El We call an iteration of the repeat loop in the algorithm a trial. We say 
that a trial is successful if the set Z generated in that trial is output. 



Repeat until halted: 

1. Generate a number i between 1 and 4, according to the following probability distribu- 
tion: 

(a) Let t = (i) + (^) + ( 3 ) + (:) . For y G {1, . . . , 4}, Prob[i = j] = ^. 

2. (a) Generate a set Z containing i points of / uniformly at random from among all 

i-point sets in Z 

(b) If i = 1 or i = 2 output Z. 

(c) If i = 3 or i = 4, compute the MBR R containing Z, and check whether Z is the 
canonical set of points for R. If so, output Z. 

Fig. 4. Las Vegas Algorithm for Generating Queries from the Logical Distribution 



Proposition 1. On any input data set I, the Las Vegas algorithm in Figure |71 
outputs a random query rectangle. 



On the Generation of 2-Dimensional Index Workloads 



119 



Proof. We consider the query rectangle output by the algorithm to be the rect- 
angle represented by the set Z output by the algorithm. 

Consider a single trial. If the set Z generated during the trial is a 1-point 
set or a 2-point set, then Z must be the canonical set of points representing the 
MBR containing Z. If Z is a 3-point set or a 4-point set, then Z is output iff it 
is the canonical set of points representing the MBR R containing Z. Thus Z is 
output iff it is the canonical set of points for the query rectangle it represents. 

We need to show that the algorithm outputs each possible query rectangle 
with equal probability. Let i? be a j-point query rectangle for some j between 1 
and 4. Let Q be the canonical set of points for R. In any trial, the set Z will be 
equal to Q iff * is chosen to be equal to j in line 1, and Z is chosen to be equal 
to Q in line 2. Thus the probability that, in a given trial, Z is the canonical set 



of points for R is 



(") 



(") 



The probability that a trial is successful is therefore j , where r is the number 
of distinct query rectangles R. The conditional probability of generating a par- 
ticular query rectangle i? in a trial, given that the trial is successful, is j/j = K 
Since the algorithm outputs a query rectangle as soon as it has a successful 
trial, each distinct query rectangle R is output by the algorithm with the same 
probability of i. □ 



Proposition 2. On any input data set I consisting of n points, the expected 
number of trials of the Las Vegas algorithm in Figure ^ where r is the 

number of distinct query rectangles represented by subsets of I, and t = (") + 

( 3 + 0 + 0 ^ 

Proof. This follows immediately from the proof of Proposition ^ in which it is 
shown that the probability that a given trial is successful is y. □ 

2.4 Expected Performance over Worst-Case Data Sets 

We now do a worst-case analysis of the expected running time of the algorithm. 
Our analysis will assume that the actual data are stored in the leaves of the 
index. When this is not the case (i.e. for ’’secondary” indexes), an additional 
random I/O is required for each item in the output set. 

Proposition 3. The expected number of trials of the Las Vegas algorithm of 
Figure^ (on the worst-case input) is 6(n?). 

Proof. By Proposition El the expected number of trials is ^ . This quantity de- 
pends on r, which can vary according to the configuration of the points in the 
data set. We now bound r. On any n-point data set, the number of 1-point query 
rectangles is n and the number of 2-point query rectangles is ( 2 ) . Therefore, on 
any data set, r is at least n-\- ( 2 ) . This bound is achieved by data sets in which 
all points fall on a line because such data sets define no 3-point and 4-point 
rectangles. Since t = d{n^), the proposition follows. □ 
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Proposition 4. There is an implementation of the Las Vegas algorithm of Fig- 
ure^ that runs in expected time 9{n^) (on the worst-case input) with 6{n\ogn) 
preprocessing. 

Proof. Consider the following implementation of the algorithm. It requires that 
the data set be preprocessed. The preprocessing can be done as follows. Form 
two sorted arrays, the first containing the data set sorted by x coordinate (with 
ties broken by sorting on the y coordinate), and the second containing the data 
set sorted by y coordinate (with ties broken on the x coordinate) . With each data 
point p in the first array, store a link to the corresponding point in the second 
array. This preprocessing can easily be implemented to run in time 0(n log n). 

The implemention of a trial is as follows. Generate the points Z using the 
first sorted array and calculate the MBR R containing them. If Z contains I or 
2 points, then output Z and end the trial. 

Otherwise, begin checking whether Z is the canonical set of points for R by 
first checking whether all points in Z lie on the boundary of R and whether each 
boundary of R contains exactly one point of Z. If not, end the trial without 
outputting Z. 

If Z passes the above tests, it is sufficient to check, for each boundary of R, 
whether Z contains the canonical point on that boundary. To check a boundary 
of R, find the point p in Z lying on that boundary. Then access the point p' 
immediately preceding p in the first array (for the left and right boundaries) or 
immediately following p in the second array (for the top and bottom boundaries) . 
Because of the links from p in the first to p in the second array, p' can be accessed 
in constant time for each of the boundaries. Check whether p' lies on the same 
boundary of R as p. If not, p is the canonical point on the relevant boundary 
of R, otherwise it is. If Z contains the canonical point for each boundary of R, 
then output Z . Otherwise, end the trial without outputting Z. 

Since each of the steps in the above implementation of a trial takes constant 
time, each trial takes constant time. By the previous proposition, the expected 
number of trials is 9(nf), and thus the expected running time is 9{n^). □ 

Fortunately, the expected number of trials is significantly lower than 0(n?) 
for many data sets. For example consider the “plus” data set, where the data 
points lie on two line segments, a horizontal and a vertical, intersecting each 
other in the middle. Thus, the data set is divided into four partitions (up, down, 
left, right) of approximately the same size. For this data set it can be shown 
that a set Z generated in Step 2(a) of the above algorithm is a canonical set 
with significant probability. In particular, if .Z is a 4-point set, the probability 
that Z is a, canonical set is equal to the probability that the four points belong 
to different partitions. Hence: 

3 2 1 3 1 

P(Z is a 4: — pt canonical set) = I----. - = — > — 

^ ^ 4 4 4 32 11 

Since for the other query types (eg. l-,2- or 3-point) the above probability is 
even higher, it follows that the expected number of trials per query rectangle for 
this data set is most 11. 
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In the following section, we prove that on a uniform data set, the expected 
number of trials before the algorithm outputs a query rectangle is less than 6, 
independent of the number of points in the data set. In Section 21 we present 
empirical results on both artificial and real data sets, showing that the average 
number of trials is similarly small. 

2.5 Algorithm Analysis: Uniform Data Sets 

Definition 5. A uniform data set is a set of points drawn independently from 
the uniform distribution on the points in the unit square. 



y3 

y4 



yi 

y2 









I 












Fig. 5. Query Rectangle 



Proposition 5. Let I be a uniform dataset of size n. The expected number of 
trials of the Las Vegas algorithm in Figure ^ on data set L is less than 6. As n 
increases, the expected number of trials approaches 6. (The expectation is over 
the choice of points in L .) 

Proof. Let {pi,P2,P3,P4} be a set of four distinct random points chosen from / 
{pi = {xi, Pi)). Since the points in / are chosen from the uniform distribution on 
the unit square, the probability that any two have a common x or y coordinate 
is 0 . Therefore, we assume without loss of generality that all four points have 
distinct x coordinates and all have distinct y coordinates^ 

Let S4 be the set of permutations of 4 elements Xi,X2,xs,X4. We associate 
with the permutation a = (xi,Xj,Xk,xi) the event A^- = {xi < xj < Xk < xi}. 
Clearly the Ac’s partition the sample space, so if B is the event that {p\, . . . ,^4} 
are the canonical set of points of a query rectangle, then by Bayes’ formula 

P{B) = ^ P{B\A,,)P{A,) 

(T^Si 

^ In practice, the probability will not be zero because of finite precision in the repre- 
sentation of points on the unit squre, but it will usually be small enough so that its 
effect on the expected number of trials will be negligible. 
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By symmetry P{Aa) = ^ for every a G S 4 and P{B\Aa) is the same for any 
(7 S S4, so P{B) is actually P{B\Acr) where cr is any permutation. Taken in 
particular (Tq = (xi, a; 2 , 2 : 3 , 0 : 4 ) (see Figure 0, then 

P{B\A^) = P{yi,y 4 between y 2 , ys) by symmetry 

= 2P(y2 <yi,V4< yz) 

rl r‘V3 i'V3 rVa 

= 2 dys dy 2 / dyi / dy4 

Jo Jo Jy 2 Jy 2 

pi pys 

= 2 dys dy 2 {y 3 - 2 / 2 )^ 

Jo Jo 

_ 1 

“ 6 

A similar argument shows that three distinct randomly selected points from / 
have probability | of being a canonical set. One or two randomly selected points 
have probability 1 of being a canonical set. 

Let Ij denote the set of all subsets of size j of the data set I, for j = 1, 2, 3, 4. 
For all sets S in Ij, let yg = 1 if the points in S are the canonical points for 
the rectangle they represent, and xs = 0 otherwise. The expected number r of 
query rectangles in / can be bounded as follows: 
if[Number of Query Rectangles] 

= Yjj=i ^[Number of j-point Query Rectangles] 

= Sj=i ^[J2seij X'S] 

= Sj=i ^seij ^[xs] 

>E4tiEseL(l/6) 

I rn\ 

~ 6\jP 

Since the expected number of trials of the Las Vegas algorithm is where 
t = (pj fbe expected number of trials on data set / is less than 6. As 

n increases, the expected number of trials approaches 6, because for large n, 
(^) » (p forj = l,2,3. □ 



3 Empirical Performance 

In order to validate the efficiency of the Las Vegas algorithm in practice, we 
conducted several experiments over standard synthetic and geographic data sets. 
We also used a technique called Levy flights to generate sets of points with 
fractal dimension similar to those found in real-life data mu All data sets were 
normalized to the unit square. In particular, we used the following 2-dimensional 
data sets: 

— Uniform: In this data set 100000 points are uniformly distributed in the unit 
square. 

— Double Cluster: This data set contains two clusters of approximately 50000 
points each, one centered near the origin, and the other close to the point 
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Data Sets 


Algorithm Performance 


Query Properties (average over 1000 queries) 




Average 
# Trials 
per Query 
Rectangle 


Seeonds 

for 

1000 

Queries 


Mean 

Area 


Mean 

Cardinality 


Mean 

Aspect 

Ratio 


Uniform 


6.033 


3.61 


0.36475 


35.67% 


1.170289 


Double Cluster 


9.530 


4.04 


0.3015 


39.28% 


1.124432 


LBCounty 


8.640 


3.25 


0.17285 


36.62% 


1.138304 


MGCounty 


6.950 


2.95 


0.1351 


37.19% 


1.234010 


Levy 1.46 


12.92 


4.64 


0.06955 


33.60% 


4.993343 


Levy 1.68 


8.741 


3.87 


0.08729 


35.11% 


0.360295 



Table 1. Performance of the Randomized Algorithm, and Properties of the 
Random Range Queries 



(1, 1). The points in each cluster follow a 2-dimensional independent Gaus- 
sian distribution with cr^ = 0.1. 

— LBCounty: This data set is part of the TIGER database m and contains 
53145 road intersections from Long Beach Gounty, GA. 

— MGGounty: The same as above from Montgomery Gounty, MD containing 
39231 points. 

— Levy 1.46: This is a data set of 100000 points generated by the Levy flight 
generator, with fractal dimension approximately 1.46. 

— Levy 1.68: The fractal dimension here was close to 1.68 with the same number 
of points as above. 

We used a main memory implementation of the algorithm to generate 1000 
random range queries for each data set and the results are shown in Tabled The 
cost of the algorithm is expressed both as the number of trials per random range 
query, and the elapsed time. Recall that all datasets are the same size except 
for the two real datasets. For the Uniform data set the cost was very close to 
6 trials, as anticipated. For the Double Gluster data set the cost was a little 
higher but remained small. The algorithm performed very well for the real-life 
data sets even though these data sets are clearly non-uniform. Finally, for the 
Levy Flights data sets, the one with fractal dimension 1.46 had the higher cos10. 
The results of the experiments indicate that our randomized algorithm is quite 
efficient in practice, and robust across a variety of distributions. 

^ This matches the intuition behind the use of fractal dimension in 0; as discussed 
in Section rz.41 data laid out on a line (fractal dimension 1) results in more trials 
than data laid out uniformly in 2-space (fractal dimension 2), hence one should 
expect worse performance with lower fractal dimension. Unfortunately in further 
experiments we found counter-examples to this trend; additional study of fractal 
dimension and query generation seems interesting, but is beyond the scope of this 
paper. 
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4 Properties of Random Range Queries 



In this section we consider various properties of the random range queries gener- 
ated by our algorithm; in particular we consider the area, cardinality and aspect 
ratio of the query rectangles. Note that, given a data set, the areas of the query 
rectangles defined by that data set may not be uniformly distributed. In fact, as 
we show here, the contrary is often true. 

We first prove a result concerning uniform datasets. 

Proposition 6. Let I be a uniform dataset of size n > 4. Let z = (pi, . . . ,pi) 
be four distinet random points chosen from L. The expected area of the rectangle 
defined by z, given that z is the set of canonical set of points of a rectangle, is 
rj = 0.36, and the variance is a'^ = 0.03 



Proof. Let B be the event that 2 ; is the canonical set of points for a query 
rectangle, and let A^r be the event associated with the permutation a as above 
(PropositionEJ) . Then for erg = (xi,X2, X3, X4) the area of the generated rectangle 
is g(z) = \x 4 — Xi\\y 3 — y 2 \. Hence, the expected area is: 



Expected Area = E{g{z)\B} = f = 6 x f g{z)dz 

Jb Jb 

= 6x^ 



I BnA„ 



P{B) 
g{z)dz 



= 6 X 4! X 



g{z)dz 

pX pX4 PX 4 , /*2l4 

/ dx4 / dxi / dx2 / dX3{X4 — xi) 
Jo Jo j X\ J X2 

rl rVa ryz rVs 

/ dys / dy2 / dyi / dy 4 {y 3 ~ 2 / 2 ) 

Jo JO Jvo Jvo 



JBr\A 

= 6x4!x2x / dx 4 



/o 
36 

“ Too 

Similarly, the variance is computed equal to = 0.03. 



If we run our algorithm on a uniform dataset of any reasonably large size, 
then with high probability, the rectangle returned by the algorithm will be a 4- 
point query rectangle (because in each trial, the probability that the algorithm 
generates a 4-point set is very high, and on a uniform dataset, the probability 
that a 4-point trial is successful is 1/6). Thus, by PropositionH the expected area 
of a rectangle generated by running our algorithm on a reasonably large uniform 
dataset is approximately .36, with variance approximately .03. A similar analysis 
shows that the expected aspect ratio of the generated rectangle (defined as the 
ratio of x-side over y-side) will be approximately 1.2 with variance approximately 
0.96. 

In light of the above results, we were interested in the spatial properties 
(namely area and aspect ratio) for all of our data sets. Tabled shows the mean 
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Fig. 6. Frequencies of query-area 
for various distributions. 




Fig. 7. Frequencies of query- 
aspect-ratio for various distribu- 
tions. 



values of the three spatial properties in experiments of 1000 random range queries 
over each of the various data sets. Figure 0 shows the frequency graphs of the 
areas of the queries for each data set. From Figure|S|is clear that the distribution 
of the area of the generated random range queries is highly dependent on the 
underlying data set. The Uniform data set gives a variety of large-area range 
queries, whereas for the fractal data sets most of the query rectangles are smaller 
than 10% of the total area. In Figured we present the frequencies of the aspect 
ratio (long-side/short-side) for the same set of range queries, and in Figured we 
present the frequencies of cardinalities. Note that cardinality seems to be less 
sensitive to data distribution than the spatial properties. 

One conclusion of this analysis is that a domain-independent “logical” query 
generator may have unusual properties, and those properties may vary widely in 
data-dependent ways. This is a cautionary lesson about generating test queries 
in a domain-independent fashion, which undoubtedly applies to the generation 
of queries over indexes in domains other than spatial range queries. A natural 
alternative is to consider using domain-specific query generators when possi- 
ble. To explore this idea further, we proceed to examine domain-specific spatial 
query generation techniques from the literature, and demonstrate that they raise 
complementary and potentially problematic issues of their own. 

5 Alternative Techniques for Query Generation 

As remarked above, other authors have considered domain-dependent query dis- 
tributions. The most extensive exploration of such distributions is due to Pagel 
et al. HH. They considered the expected cost of 2-d range queries over distribu- 
tions other than the logical distribution; in particular they proposed four specific 
classes of distributions. 
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Fig. 8. Frequencies of query-cardinality for various distributions. 

The first class, called QA4i, consists of distributions that are uniform over 
the space of all squares of area k (for some fixed k) whose centers fall “within 
the data space”, i.e., within the spatial boundaries of the existing data set. 
This is a natural scheme with clear domain-specific properties. However, using 
a distribution from this class to test the performance of an index has notable 
drawbacks. The workloads generated from this distribution are very sensitive to 
the data distribution, and may not be statistically sound in covering the possible 
behaviors of the index. For example, if the area fc is a relatively small fraction 
of the data space (as is common in many previous studies), and if the data is 
clustered in a small region of the data space, then random squares chosen from 
QM 1 are likely to be empty. Repeatedly testing an index on empty queries may 
not yield the kind of information desired in experiments. 

The other classes of distributions proposed by Pagel et al. attempt to over- 
come such drawbacks. Each of these classes is defined in terms of a density 
function D on the data space. One, called QM2, is like QM\ in that it con- 
sists of distributions over the space of squares of area fc, for some fixed fc, whose 
centers are within the the data space. In QM2 distributions, though, the prob- 
ability of each square is weighted according to the value of D at the square’s 
center. The remaining two classes of distributions, QAI3 and QAI4 are over the 
space of squares that enclose a fixed fraction s of the total probability mass (as 
defined by D), for some s, whose centers are within the data space. Distributions 
in QM.'i are uniform over such squares, and distributions in QAJ4 weight each 
square according to the value of D at its center. 

Pagel et al. point out that the expected cost of performing a random rect- 
angular query in a particular LSD-tree (or R-tree, or similar data structure) is 
equal to the sum, over all leaves in the tree, of the probability that a random 
rectangle intersects the rectangular region defined by the points stored in that 
leaf [Ej. They use this fact to compute analytically, for particular trees, the ex- 
pected cost of a random rectangular query drawn from a distribution in QMi. 
In contrast, to compute the expected cost of a rectangular query drawn from the 
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logical distribution on rectangular queries, we would use an empirical approach; 
we would generate random queries, determine the cost of each, and take an av- 
erage. The logical distribution is a difficult one to work with analytically; unlike 
distributions in QM.\, it is a discrete distribution defined by a data set. 

In later experimental work |18I16| , Pagel et al. use only distributions in QM. i . 
As they point out, exactly computing expected query cost with respect to dis- 
tributions in QAis and QM4 can be difficult m In addition, distributions in 
QM2, Q-Ms, and QM4 are defined in terms of a distribution D on the data 
space. Since a real data set does not directly provide a distribution D on the 
data space, the last three query distributions are not well-defined “in the field,” 
that is, for real data sets (although in some cases it may be feasible to model 
the distribution D). 

In short, the Pagel query distributions present a mirror image of our logi- 
cal query distribution. The Pagel distributions’ domain-specific properties are 
easily controlled, and they are amenable to analytic analyses - they appeal to 
the intuitive properties of 2-space. By contrast, the logical properties of these 
distributions (e.g. query cardinality) are sensitive to data-distribution and often 
wildly skewed, and some of the available techniques are inapplicable to real data 
sets. The domain-specific and logical distributions have very different strengths 
and weaknesses for studying indexes, and we elaborate on this in the next sec- 
tion. 



6 On Randomization, Benchmarks, and Performance 
Studies 

One advantage of domain-dependent distributions like QM.\ is that they attempt 
to model a class of user behavior. This is the goal of so-called Domain- Specific 
Benchmarking ( 7 ], and is clearly a good idea: one main motivation for bench- 
marking is to analyze how well a technique works overall for an important class 
of users. In the case of spatial queries it makes sense to choose square queries of 
small area, for example, since users with graphical user interfaces are likely to 
“drag” such squares with their mouse. This is natural when trying to get detail 
about a small portion of a larger space shown on screen - e.g., given a map of 
the world one might drag a square around Wisconsin. 

But random workloads can also be used to learn more about a technique - 
in particular to “stress test” it. To do this, one wants to provide a diversity of 
inputs to the technique, in order to identify the inputs it handles gracefully, and 
those it handles poorly. For indexes, a workload is described logically as a set 
of queries (subsets of the data), and the indexing problem can be set up as a 
logical optimization problem: an indexing scheme PH should cluster items into 
fixed-size buckets to optimize some metric on bucket-fetches over the workload 
(e.g. minimize the total number of bucket fetches for a set of queries run in 
sequence). It is by no means clear that domain-specific considerations help set 
up these stress tests - logical workloads seem more natural for this task. 
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One can easily quibble with both of these characterizations, however. Clearly 
the (domain-dependent) QMi workloads do a bad job of modeling user behavior 
when they generate many empty queries - in essence, they ignore the fact that 
users often do care about logical properties like cardinality. Conversely, the stress 
imposed by the logical workload on a 2-d index is questionable if most of the 
queries are actually large and squarish as in Tabled evidence suggests that long, 
skinny queries are the real nemesis of spatial indexes like R-trees, which usuall 
try to cluster points into squarish regions I17ll2rl . 

What then can one conclude about the pros and cons of logical and domain- 
specific query generation? In short, that regardless of the technique an experi- 
mentalist uses to generate queries, she needs to understand and be able to control 
the distribution of a workload over domain-specific and logical properties of rel- 
evance. Identifying these properties is not a simple problem, and understanding 
how to generate queries that are well distributed over them is a further challenge. 
One strong conclusion of our work is that this process has been little explored 
in index experiments to date, and is in fact fairly complex. 

Two-dimensional range queries are probably the best-understood, non-trivial 
indexing challenge. Thus a natural direction for future work is to attempt and 
merge the insights from the previous sections to develop a malleable, well- 
understood toolkit for experimenting with 2-d indexes. We conclude by exploring 
some ideas in this direction, and raising questions for generating queries in other 
domains. 



7 Conclusion and Future Directions 

In this paper we highlight a new approach to generating random queries for 
index experimentation, which uses a logical distribution on queries. We present 
an algorithm to generate queries from this distribution, showing that it has good 
expected performance for some distributions, and good measured performance 
over a variety of real and synthetic data. A remaining open problem is to devise 
an algorithm for this task with good guaranteed time- and space-efficiency over 
all point distributions. 

The very different properties of this distribution and previously proposed 
distributions suggest a new line of research: developing techniques to allow ex- 
perimentalists to easily understand and control various properties of their work- 
load. A direct attack on this problem is to first map desired domain-dependent 
properties into a distribution over the space of logical queries, and then devise 
an efficient algorithm for choosing from that distribution. In general this seems 
quite difficult, but the original problem tackled in this paper is a simple instance 
of this approach: it specifies a distribution over all queries P{D), with 100% 
of the distribution falling into the (domain-specific) category of query rectan- 
gles. Perhaps this direct approach will prove tractable for other simple spatial 
distributions as well. 

® In fact our interest in the subject of random 2-d range queries was sparked by an 
attempt to experimentally validate this hypothesis for a variety of spatial indexes! 
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In addition to this direct theoretical challenge, a variety of potentially useful 
heuristics suggest themselves as well. Rather than map spatial properties to a 
distribution over the logical queries, one can simply partition the data set on 
spatial properties, and use the uniform logical distribution over the partitions. 
For example, to generate a distribution with smaller average query area, one can 
tile the data space and run our Las Vegas algorithm over data partitions that 
correspond to tiles. This seems rather ad hoc, but is perhaps easier to reason 
about logically than the totally domain-specific techniques of Pagel, et al. It is 
also applicable to extant “real-world” data sets. 

Another heuristic is to introduce new points into the query generator’s data 
set in order to achieve biases on spatial properties. For example, to increase 
the variance in aspect ratio of a uniformly distributed point set, one can insert 
clusters of points along the extremes of one axis or the other. One can then 
run the resulting queries over the original data set. This may allow for a better 
control over the resulting query mix than a domain-specific spatial technique. 
The goal of the future research should be the creation of a domain-independed 
framework for index benchmarking and testing. 
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Abstract. The vast number of design options in replicated databases requires 
efficient analytical performance evaluations so that the considerable overhead 
of simulations or measurements can be focused on a few promising options. A 
review of existing analytical models in terms of their modeling assumptions, 
replication schemata considered, and network properties captured, shows that 
data replication and intersite communication as well as workload patterns 
should be modeled more accurately. Based on this analysis, we define a new 
modeling approach named 2RC (2-dimensional replication model with 
integrated communication). We derive a complete analytical queueing model 
for 2RC and demonstrate that it is of higher expressiveness than existing 
models. 2RC also yields a novel bottleneck analysis and permits to evaluate the 
trade-off between throughput and availability. 



1 Introduction 

Replication management in distributed databases concerns the decision when and 
where to allocate physical copies of logical data fragments (replica placement), and 
when and how to update them to maintain an acceptable degree of mutual consistency 
(replica control). Replication intends to increase data availability in the presence of 
site or communication failures, and to decrease retrieval costs by local access if 
possible. The maintenance of replicated data is therefore closely related to intersite 
communication, and replication management has significant impact on the overall 
system performance. 

The literature offers several algorithms for replica placement [32,22] as well as for 
replica control [8,9,15]. However, the sophistication of such methods is not matched 
by today’s analytical models for performance evaluation. Considering the vast 
number of alternatives in the design of a replication schema, it becomes apparent that 
existing analytical models only consider very extreme replication schemata, e.g. no 
replication or full replication. Furthermore, the important role of intersite 
communication in replica management is not sufficiently taken into account. While 
the evolution and symbiosis of distributed database and modern communication 
systems is progressing, theoretical performance models to evaluate such systems lag 
behind. Our aim is to identify and remedy such flaws in the performance evaluations 
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and to derive an analytical modeling approach that describes real-world replicated 
databases more accurately so that new and more expressive results are obtained. 

In section 2 we present a critical review of existing performance models including 
a formal classification of how replication can be modeled. Section 3 presents the 
requirements for a new comprehensive performance model (2RC) to overcome the 
deficiencies of existing approaches. A new 2-dimensional model of replication and 
the dependency structure captured in 2RC is described. Section 4 develops an 
analytical queueing model that implements 2RC. It is based on the 2D-replication 
model and a detailed communication model. Section 5 presents results derived from 
our model, including a bottleneck analysis which is the critical part of any analytical 
throughput estimation [33]. 



2 Shortcomings in Existing Performance Models 

In this section we analyze alternatives in performance modeling of distributed 
databases which reveals drawbacks in existing studies and motivates 2RC. 



2.1 General Modeling Concepts and Communication 

Performance studies of distributed databases employed analytical methods [20,26,27, 
16,13,2], as well as simulations [3,17,25,31,18,29,10,7]. Simulations can evaluate 
complex system models whose level of detail precludes analytical solutions. 
However, simulations are costly in terms of programming and computing time. Thus, 
simulations often fail to cover the parameter space and to carry out a sensitivity 
analysis as thoroughly as desired. 

Simulation and analytical studies often use queueing systems as the underlying 
models. Early queueing models of distributed databases model a fully replicated 
database of m local sites by a M/M/m/FCFS system with blocking [4,11,24]. Read 
transactions are processed by the m servers in parallel, while write transactions block 
all m servers. This models shared reads and exclusive writes. Major drawbacks of 
these models are that intersite communication is neglected and all sites share a single 
queue of incoming transactions. 

To remedy these flaws, distributed databases can be modeled by queueing 
networks. [10, 23] use networks of M/M/1 queues, i.e. each local database is modeled 
as an M/M/1 system. However, this still restricts all transactions to have the same 
exponentially distributed service times. More general, [5,16] model the local 
databases as M/G/1 queues with arbitrarily distributed service times, while [13,20] 
use networks of M/H/1 systems with 2-phase hyper-exponentially distributed service 
times to assign different exponentially distributed service times to read-only 
transactions and updates. Nevertheless, such models do not allow to evaluate real- 
world systems with more than two transaction types. 

Most analytical performance studies model the communication network by an 
infinite server that introduces a constant delay for each message, regardless of 
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message size or network load [14,23,13,21,16,20]. "Infinite" means unlimited 
transmission capacity, no queueing of messages, and the network is never considered 
to be the bottleneck. Such models would predict that replication always deteriorates 
throughput but never increases it, which we will disprove in section 5. In modern, 
large wide area applications and in wireless and mobile information systems that 
suffer from low bandwidth, the communication links may indeed become a 
bottleneck. However, very few analytical studies tried to combine a detailed database 
model with a detailed communication model [2,26]. Most authors also assume 
uniformly distributed data access, i.e. each data item is accessed with equal 
probability [14,28,10,5,23,18, 29,21,20,3]. Non-uniform data access is more realistic 
but modeled in very few analytical studies [31,16]. Lock conflicts and blocking of 
transactions are usually only modeled to compare concurrency control algorithms 
[14,7,10,18,21]. Such models are of considerable complexity. They typically have to 
use simulations and simplified modeling assumptions concerning replication and 
communication. 



2.2 Classification of Replication Models 

Some performance studies of distributed databases simply assume no replication, i.e. 
each logical data item is represented by exactly one physical copy, [28,21]. Other 
models consider full or 1 -dimensional partial replication: 

(1) All objects to all sites (full replication) 

Most performance evaluations assume full replication [11,14,4,24,23,18,29,27,3,17] 
i.e. all data objects are replicated to all sites so that each site holds a complete copy of 
the distributed database. This is an extreme case of replication and it has been 
recognized that for most applications neither full nor no replication is the optimal 
configuration [10,13,1]. 

(2) All objects to some sites ( 1 -dimensional partial replication) 

Several studies modeled partial replication in the way that each data object is 
replicated to some of the sites [7,10,16,31,20]. Formally, the degree of replication can 
be denoted by a parameter r g (l,2,...,n], describing that each logical data item is 
represented by r physical copies, where n is the number of sites, r = 1 expresses no 
replication, r = n means full replication, and if r > 1, every data item is replicated. 
Consequently, either no or all data items are replicated. In many applications there is 
update-intensive data which should be replicated to very few sites while read 
intensive data should be replicated to many sites. This common situation cannot be 
modeled with the all-objects-to-some-sites scheme. 

(3) Some objects to all sites (1 -dimensional partial replication) 

Alternatively, the degree of replication can be denoted by r g [0;1] describing the 
percentage of logical data items that are fully replicated to all sites. A data item is 
either fully replicated or not replicated at all. r = 0 expresses no replication, r = 1 
means full replication. To the best of our knowledge, this model of partial replication 
has only been considered in [2,13,1]. The some-objects-to-all-sites scheme is 
orthogonal to the all-objects-to-some-sites approach since the degree of replication is 
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defined along the percentage of data items as opposed to the number of copies. 
However, in large wide area distributed databases it is hardly affordable to replicate 
some data items to all sites (causing considerable update propagation overhead) and 
others to none (reducing their availability severely). Thus, the some-objects-to-all- 
sites scheme is not realistic. 



3 A 2D-Replication Model with Integrated Communication (2RC) 

Learning from the existing models, we define the requirements for an improved 
modeling approach: 

A more expressive model of replication is needed to represent and evaluate 
realistic replication schemata. Not only response time but also throughput and 
bottlenecks must be computable as important performance criteria. This requires to 
capture load dependent communication delay, network limited transaction 
throughput, and the interplay between replication and communication. Detailed 
transaction and communication patterns must be describable, so that real-world 
applications can be modeled. Non-uniform data access, the quality of replication 
schemata and relaxed coherency must also be considered. 

All these requirements are met by 2RC. Additionally, the model of transaction 
processing in 2RC follows the primary copy approach [30], since it has been judged 
advantageous over other replica control concepts [15,18] and is implemented in 
commercial systems like Sybase and Oracle. In 2RC, asynchronous update 
propagation to the secondary copies is modeled, i.e. 2-phase-commit processing of 
updates is not considered, and transactions are assumed to be executed at a single site, 
either the local or a remote site. Since 2RC is not primarily intended to compare 
concurrency control algorithms, it refrains from modeling lock conflicts to allow for 
more details in the replication and communication submodels. 



3.1 The 2-Dimensional Model of Replication 

Based on the classification of section 2.2, the two orthogonal 1-dimensional concepts 
are combined to a new 2-dimensional scheme called "Some objects to some sites". In 
this, replication is modeled by a pair (r,,r^) e [0;1] x |2,...,n} such that e [0;1] 
describes the percentage of logical data items which are represented by physical 
copies each, i.e. they are replicated to of the n sites. A share of 1 - logical data 
items remain unreplicated, r, = 0 expresses no replication, (r, = 1, = n) means full 

replication. 

For d logical data items, a replication schema (r,,rj) increases the number of 
physical copies from d (no replication) to {r, - d ■ rj) + (cf • (1 - r,)). Viewing the 
number of copies of replicated objects (j^ -d ■ rj) as the actual extent of replication, we 
express it (for visualization and further calculations) independently from d and 
normalized to the interval [0;1] as an overall level of repl ication, yielding {r, ■ rj)ln. 
This characteristic of the 2D-approach is depicted in iFig. l| for n = 50 sites. 
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Performance models which assume full replication only consider the point (1 ,m) as the 
possible replication schema. The all-objects-to-some-sites scheme analyses 
replication along the bold line from point (1,50) to (1,0) only. The orthogonal some- 
objects-to-all-sites scheme studies replication along t he line from (1,50) to (0,50) 
only. However, the 2D-scheme considers any point in |Fig. 1| a possible replication 
schema so that real-world replication strategies can be captured more accurately. 
Thus, we believe that the 2D-model is a profitable contribution towards a better 
understanding of how replication affects distributed system performance. 




Fig. 1. The 2-dimensional model of replication 



3.2 Dependency Structure 

Apart from how the various aspects of a real system are mod eled, it also matters 
which dependencies between them are considered in the model, [pig. 2| sketches the 
structure of dependencies we consider in 2RC. Rectangular nodes represent input 
parameters and modeling assumptions, oval nodes stand for intermediate or final 
results, the latter in bold. An arrow from a node A to a node B indicates that B 
depends directly on A. The 2D-replication scheme is a core part of our model and has 
direct impact on the quality of replication, the arrival rates and the network traffic, 
and thus substantial influence on all further results, x transaction types (with different 
arrival rates and service time distributions) and 2 message types per transaction (with 
individually distributed transmission times depending on the message size) allow to 
model a wide range of different applications and workload patterns. The two bold 
arrows highlight the important dependencies through which load dependent 
communication delay and network limited throughput are captured. The overall 
throughput depends on both the network and the local database throughput, which 
allows a bottleneck analysis. 
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Fig. 2. Dependency structure of 2RC 



For comparison, Fig. 3 shows the dependencies captured in [16,2] which we found 
typical for many analytical models. The model in [16] assumes constant 
communication delay and unlimited network capacity. On the database side, they 
consider ID-replication and do not allow different types of queries or updates (like all 
models we examined). The model in [2] captures load dependent communication 
delay, hut this is not exploited for throughput calculation and combined with a simple 
replication and transaction model. In [16] and [2] the arrival rates depend on non- 
uniform data access, modeled by a factor for access locality and hot spot access 
respectively, while such dependencies are neglected in most other studies. Almost all 
analytical studies fail to cover sufficient dependencies between transaction processing 
and communication, so that they do not calculate throughput as a performance 
criterion. In conclusion, 2RC combines a comprehensive database and replication 
model with a detailed communication model, and covers more interdependencies 
between the two than previous studies. 



4 Analytical Queueing Model for 2RC 

This section presents the mathematical implementatio n of a qu eueing model of a 
replicated database according to the 2RC approach. iTable J shows the model 
parameters. 
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Fig. 3. Dependencies in Hwang et al. 96 [16] and Alonso et al. 90 [2]. 



Parameter 


Meaning 


n 


Number of sites 


T 


Number of transaction types 


a. 


Percentage of transactions of type i 


q. 


Function to distinguish between queries and updates 


K 


Transaction arrival rate per site (TPS) of TA type i 


r, 


Percentage of data items replicated 


P 


No. of copies of replicated data items 


t, 


Mean service time for a transaction of type i (sec) 


k 


Coherency index 


baud 


Commnnication bandwidth (bps) 


loc, 


Transactions’ locality without replication 


plcmt. 


Quality of replica placement 


sel 


Quality of replica selection 




Fine tuning replica placement 


f_sel 


Fine tuning replica selection 




Probability of local transaction execution 


• send i 

size^ - 


Message size for a send of a transaction of type i. 


• return i 

size^ 


Message size of returned qnery results (byte) 


> send _ i 


Mean time to send a transaction of type (sec) 


. return _ i 


Mean time to return query results (sec) 



Table 1. Model parameters. 



4.1 Workload 

We model a replicated database by an open qneueing network in which each of the n 
identical sites is represented by a M/H^/1 system. Transaction arrivals to the 
distributed system are modeled by n identical Poisson processes with parameter X, i.e. 
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one arrival stream per site such that the transaction load increases with the number of 
sites. The x different types of transactions are numbered 1,2, ,x. A transaction is 

r 

with probability a. of type i, = 1, so the Poisson arrival process at a site consists 

i=l 

of X separate streams having rate A = • A, 1 < / < x. A characteristic function q 

distinguishes between queries and updates: 

{ l,if transactions of type / are queries 
0,if transactions of type / are updates 

The number of data objects accessed per transaction is assumed to be geometrically 
distributed, so the service time for a transaction of type / at a local database is 
modeled as exponentially distributed with mean t. (seconds) [28]. Thus, the service 
time for the combined arrival process of all x transaction types follows a x-phase 
hyperexponential distribution. 



4.2 Locality and Quality of Replication 

Without replication but due to skillful data allocation transactions exhibit a behaviour 
of locality, i.e. they tend to access data items locally available at their site of 
submission. This is modeled by the probability loc. e [0;1] (1 < / < x) that a 
transaction of type / can be executed at the local site, while it has to be forwarded to a 
remote site with probability 1 - loc.. Introducing partial replication then 

increases the probability that a query can be answered locally by (r, • r^yn. Due to the 
primary copy approach, the write availability does not increase. 

The selection of data items to replicate and the decision where to place them can 
have significant impact on the overall system performance [32]. Thus, we consider 
this quality of a replication design in the performance evaluation. We find that replica 
selection has major impact on updates, while replica placement is more significant for 
query processing: 

Updates : Selecting many update intensive data items for replication causes high 
update propagation overhead. Therefore we introduce a "selection parameter" 
G [0;l/r,] for 1 < / < x and q{i) = 0, expressing to which extent updates of type / 
tend to access data items that were selected for replication, sel = 0 means that the 
replicated data items are never changed by updates of type /. sel. = 1 signals no 
preferences towards replicated or unreplicated data, and sel. = 1/r, declares that 
replica selection is so unfortunate that updates of type / always access replicated data. 

Queries : Even if all read intensive data items are replicated, performance gains are 
quite low as long as these replicas are not available locally to the queries. Thus, a 
"placement parameter" plcmt. e [1; n!{r, ■ r^] for 1 < / < x and q{i) = 1 expresses to 
which extent replica placement is increasing the probability that a query of type / can 
be executed locally, plcmt. = 1 means that replica placement does not necessarily 
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increase the queries’ locality, and plcmt. = n/{r^ ■ declares that queries of type i can 
always run at their home sites. 



The higher the degree of replication the more likely it is that not only update 
intensive data items are selected for replication, and the more likely it is that 
replication increases the local read probability. Therefore, the parameters sel and 
plant, have to depend on the degree of replication and are defined as 



plcmt. 



1 






'1 '2 



/ _ plant- 



and 



sel. 



1 



(t) 



/ - 



so that the parameters e [0,1] and f_sel e [-oo,l] can ,fine-tune“ how far or 
close plcmt. and sel. follow their optimum values nl(r^ ■ r^) and 1/r, respectively. 

We assume that the replicas are distributed evenly across the sites, so that each site 
receives an equal share of forwarded transactions and propagated updates. Thus, the 
overall probability £. that a transaction of type i (1 < i < x) can be executed at its 
local site amounts to 

n ■ rj 

i j = loc. -¥ q. ■ (I- loc) ■ • plcmt ^ 

n 

because without replication a transaction is processed locally with probability loc. 
(first term). With probability 1 - loc. queries {q.= 1) cannot be executed locally, but 
replication increases the local read availability by (r, • or higher, depending on 
the replica placement (second term). Note, that Zoc,, sel and plcmt. model non- 
uniform data access. 



4.3 Transaction Processing and Arrival Rates 

The performance of replicated databases can be improved if the requirement of 
mutual consistency among the replicas of a logical data item is relaxed. Various 
concepts of relaxed coherency can be denoted by coherency conditions which allow 
to calculate a coherency index k g [0;1] as a measure of the degree of allowed 
divergence [13]. Small values of k express high relaxation, A: = 0 models suspended 
update propagation, and for k=\ updates are propagated immediately. 

For 1 < i < X, the total arrival rate of transactions of type i at a single site 
amounts to 

n-\ n-\ 

because a share of of the incoming A. transactions can be executed locally (first 
term) whereas the remaining {\-l^)■X. transactions are forwarded to sites where 
appropriate data is available. The other n-\ sites also forward 1 . of their X. 
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transactions which are received by each of the remaining databases with equal 
probability l/(n-l). This explains the second term. The third term covers update 
propagation and its possible reduction through relaxed coherency: If transactions of 
type i are updates (i.e. 1 - q(i) = 1) the local load is increased by update 

propagation from the n-1 remote sites. The probability that an update at one of the n-1 
remote sites hits a primary copy which is replicated, is ■ sel. The probability, that 
one of the corresponding secondary copies resides at the local database is {r^-\)l{n-\) 
because the r^-\ secondary copies are distributed evenly over n-1 sites. Finally, 
update propagation may be reduced by relaxed coherency, i.e. if A: < 1. The above 
formula simplifies to 

T 

= A, + (1 - • (r, ■ sel) • (r, - 1) • A: • A, and we define A""“' := . 

i=l 



4.4 Intersite Communication 



Two messages are required to execute a transaction at a remote site: a send and a 
return, e.g. a query is sent to a site and the result is returned. For each transaction type 
i the communication delay for a send (return) is modeled to be exponentially 
distributed with mean seconds (1 < i < x). These values mainly 

depend on the bandwidth and the message size. Therefore, the parameter baud 
represents the network's bandwidth in bps, and size)‘"‘‘-‘ { size''f'‘"' -' ) denotes the 
message size in bytes for a send (return) of a transaction of type i. The means t 
and t'))""'''-' characterizing the exponentially distributed service times are hence 
defined as 



• size,. 



baud 



and 



• size. 



baud 



The average number of messages per second in the distributed system amounts to 
N = ^n-(\-i))-X^+(\-q^)-n-(r^-seli)-(r^ - !)• A: • A^ + « • (1 - f J • A,. (*) 

i=\ i=\ 



The first sum covers messages of type send (transactions forwarded to remote sites 
due to a lack of appropriate local data and update propagation), the second sum are 
returned query results. Remote updates are assumed not to be acknowledged and thus 
do not cause return messages. To model limited network capacity, the local databases 
are considered to be connected to the network via local communication servers 
modeled as M/H 2 .J. /I systems. The arrival rate at any such server is N In messages 

per second because each site sends and receives the same number of messages due to 
the sites’ identical layout and symmetrical behaviour. The service time follows an 
H 2 .J. distribution, because x transaction types imply 2x different message types: x 
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message types have an exponentially distributed service time with mean ^ and x 

message types with mean (1 < / < x.). The expression (*) implies, that a share 

of {n- {\- 1 ^)- +{\- q-)' n-{r^- sel^)- ■ X^)l N of the N messages has a 
mean service time - ‘ (for 1 < ; < x) and a share of {q- ■ n- {\ - 1 - ) ■ X-)l N of the 
messages has a mean service time (for 1 < i < x). Hence, the first and second 

moment of the service time distribution can be derived following [19]. 

Subsequently, the average waiting time at a local communication server can be 

obtained using the Pollaczek-Khinchin formula for general M/G/1 systems [19] and 
results in 

_ Y^{(l-t,)-X,+{l-q,)- {r, ■ sel, )-{r,-\)-k-Xfj- (tr"-' f . (1 - f ^ ) . A, (tr™-' 

W^ = ±l 

j=l 



4.5 Performance Criteria 



Similar to the calculation of W^, the mean waiting time VP at a local database is 



found to be 
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so that the combined average response time over all transaction types results in 






where 



fi7 . . send 



R,=W+t,+(l-£,)-(W^ + t 



. return _i^ 



is the response time for transactions of type i. On average a transaction (of type i) 
needs to wait for W seconds at a database to receive a service of t. seconds. 
Additionally, with probability a transaction needs to be forwarded to a 

remote site which takes seconds to wait for plus the time to be sent and returned. 
Note, th we assume = Q if q{i) = 0, i.e. no return messages for updates. 

In steady state, the throughput of the local databases equals the arrival rate X but is 
bound by the limited system capacity. Specifically, the throughput can grow until 
either a local database server or a communication server (the network) is saturated, 
i.e. its utilization (p^ or p^, respectively) reaches 1. Solving the equations = 1 and 
p^ = 1 for the arrival rate X yields the maximum database limited throughput T], and 
the maximum communication/network limited throughput T), as 
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Td= and 



^c= ■sel)-{r^-\)-k-a.)-tl‘’“‘-' +q. ■ {l~ I ^) ■ a, -f; 






respectively. The maximum throughput at a database site is T = min{T^,T^) because 
whatever server is saturated first (either the database or the communication server) is 
the bottleneck and limits throughput. The overall system throughput amounts to n-T. 



5 Results and Expressiveness of the 2RC -Approach 

The results we present are based on parameter 
values which are carefully chosen after 
measurements in database systems for telecom 
applications [12]. The values also agree with 
[2,10]. For ease of presentation, we consider 
only 3 transaction types: short queries, 

updates, and long queries (e.g. statistical 
evaluations). 

The sensitivity analysis (omitted here for 
brevity) showed that parameter variations 
affect the performance values in a reasonable 
way, i.e. the model is stable. Although we will 
mention absolute values to refer to characteris- 
tics in the diagrams below, we consider the 
general trends, shapes and expressiveness of 
the graphs as the primary results. Naturally, 
the graphs depend on input values like the 
number of updates etc. However, here we do not intend to compare results for many 
different parameter combinations rather than to show that the 2RC approach allows a 
more expressive analysis for any set of parameters. 

ti^ig. 4| shows the maximum throughput T(r^,r^) = min(T^,T^) over the r^-r^-space. A 
ID-replication model considers either the = 1 -edge“ of the graph, or the 
„r^ = 50 -edge“. Either case merely expresses that the throughput increases with a 
moderate degree of replication (r, = 0.3 or = 10) but decreases remarkably when 
replication is medium or high. (Note: Models which assume unlimited network 
capacity cannot foresee this increase, e.g. [13,16,30].) However, the 2D-model tells 
us more: As long as less than 35% of the data items are replicated (r^ < 0.35) 
throughput can be maximized by placing copies on all sites (r^ = 50), reaching its 
highest peak of 320 TPS for r^ = 0.3. If availability considerations require more data 
items to be replicated (e.g. 50%), a medium number of copies yields the maximum 
throughput (e.g. = 30 for = 0.5). When striving for very high availability, it is 
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worthwhile to consider that replicating 75% of the data items to 40 sites allows a 
twice as high throughput as full replication. These results show the increased 
expressiveness of the 2D-model over 1 -dimensional approaches. 



Maximum Throughput 



Some-objects-to-all-sites (r 2 = 50) 




□ 300-350 
Id 250-300 I 

□ 200-250 

□ 150-200 

l□100-15oj 

■ 50-100 



All-objects-to-some-sites (r^ = 1) 



Fig. 4. Maximum Overall System throughput in TPS 



The result that throughput can be maximized by placing copies at all sites for 
Tj < 0.35 depends on the quality of replication. The input value /_ie/ = -0.8 models a 
fairly good replication schema in which mainly read intensive but very little update 
intensive data items are selected for replication. With such a careful replica selection 
the update propagation overhead does not increase dramatically as the number of 
copies (Tj) is maximized to enhance local read availability. This relieves 
communication and increases the overall throughput. For larger values of r, (r, — > 1) 
it becomes increasingly difficult to avoid replication of update intensive data items 
such that pla cing c opies at all sites is not affordable anymore (cf. Fig. 5). 
Consequently, Fig. 4| looks different for a bad replica selection (f_sel > 0), i.e. the 
throughput peak is found at = 1 for a low number of copies (r^ < 12). The more data 
items are replicated (r, — > 1) the more difficult it is to pick only update intensive 
objects for replication such that the read availability increases. This is captured in the 
model because sei converges towards 1 (i.e. unbiased replica selection) for — > 1 

even for bad settings oif_sel. 

The results presented throughout this section and the benefits of replication also 
depend on the percentage of updates in the overall workload. With an increasing 
percentage of updates the benefits of replication are gradually reduced and eventually 
outweighed by its drawbacks (update propagation cost vs. local and parallel read 
access) such that no replication (r, = 0 / = 1) yields the best performance. 
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Since the throughput calculation considered 
limited network capacity, a bottleneck 
analysis is possible for any replication schema 
and can be read off Fig. 5: For low replication 
(r, < 0.3 or < 10), remote data access 
saturates the network earlier than update 
propagation saturates the local databases. For 
medium and high replication (r, > 0.5, > 

25) more update propagation network traffic 
is generated, but at the same time increased 
local data availability relieves the network, 
while the transaction load to update secondary 
copies grows and causes the local databases to 
be the throughput bottleneck. 




Fig. 5. Bottleneck Analysis 



The combined average response time R is shown in i Fig, d . A moderate degree of 
replication (r, < 0.5, < 20) leads to increased local and parallel data access and thus 

reduces response time from over 1 to less than half a second. Extending replication 
along both dimensions rapidly saturates the system with propagated updates which 
outweigh the advantage of local read access and cause high response times. The 2D- 
model reveals that low response times are still possible if replication is extended 
along one dimension but kept moderate in the other. Replication schemata as different 
as (r^ = I, r^ = 10), (r^ = 0.3, r^ = 50) or (r, = 0.5, = 30) could be chosen to satisfy an 

application's individual requirements regarding reliability and availability, while 
retaining similarly low response times. 1-dimensional models of replication did not 
allow this kind of examinations. 



To address scalability issues. iHg. tI depicts the throughput as a function of r, and the 
system size n. Note that the model considers one transaction arrival stream per site 
such that the overall transaction load increases as the number of sites increases. 
Furthermore, the number of logical data objects increases with the number of sites. As 
an example, r^ is set to 2n/3 for any value of n, i.e. r,% of the logical data items are 
replicated to 2/3 of the n sites. If 100% of the data items are replicated, the throughput 
grows to about 100 TPS as the number of sites is increased to 30. Larger systems do 
not achieve a significantly higher throughput because without relaxed coherency high 
replication in large systems causes considerable update propagation overhead which 
hinders scalability [15]. Reducing replication (r^ < 1) gradually improves scalability. 
If less than 40% of the data objects are replicated, far fewer update propagation 
tran sactions have to be processed so that the databases are not the bottleneck anymore 
(see EZ 'tI and throughput can grow to over 500 TPS for = 0.25. 
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Average Response Time 
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Fig. 6. Average overall response time (Tj , r 2 ) 




Throughput Bottleneck 




Fig. 7. Overall throughput r(r,,n) with Bottleneck Analysis 



As indicated in the discussion for Fig. 4, the way throughput and scalability can be 
improved by means of replication depends critically on the selection of data items for 
replication and the placement of their copies. In Fig. ^ we examine the impact of the 
quality of replication on the scalability by considering a bad replication schema, i.e. 
one in which (a) not only read intensive but also a considerable amount of update 
intensive data items are replicated and (b) the copies are placed at randomly chosen 
sites rather than at remote sites where they are read particularly often. This can be 
modeled by the parameter settings /_j?/cmt = 0 and/_^e/. = 0.6. Such a replication 
schema causes more replica maintenance overhead than benefits through local and 
parallel read access. Since the transaction load and the number of secondary copies 
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grows with the number of sites, the local databases become saturated with propagated 
update s as the system size increases (except for a very low degree of replication, cf. 
|Fig. 8| ). This drastically deteriorates throughput and scalability. Consequently, the 
throughput increases as replication is reduced towards 0%, which means that no 
replication is better than bad replication. However, replication might still be required 




to meet the availability requirements, and results like Fig. 8| clarify the trade-off 
involved. Some might consider 100 an unreasonably high number of database sites, 
but the general findings represented by | Fig. 1| and Fig, j do not drastically change if 
the maximum number of nodes is reduced to 40 or 20 sites, except for absolute 
values. Furthermore, depending upon the application, the number of sites may indeed 
range from only a few sites to several hundred sites [3]. 



6 Summary and Outlook 

Based on an analysis of existing performance evaluations, we presented 2RC as an 
improved modeling approach for performance evaluations of distributed and 
replicated database systems. 2RC is of increased expressiveness through a new 2- 
dimensional replication model, an advanced database communication model, the 
possibility to model the quality of a replication schema, and the ability to consider 
arbitrary transaction and communication patterns of real-world applications. 
Particularly, 2RC reflects the close interplay between replica management and 
intersite communication, and allows for bottleneck considerations. The findings show 
how partial 2D-replication schemata, which have not been evaluated previously, 
affect response time, throughput and scalability. Although the results presented here 
illustrate only a limited number of the possible parameter variations, they demonstrate 
that 2RC allows more expressive performance evaluations than 1 -dimensional 
approaches with simplifying communication models. 
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Numerous parameter variations can be investigated in an analytical model as 
opposed to simulations or measurements, making full validation virtually impossible 
[2,16,26,6,20]. However, a partial validation of our model has been conducted by 
matching the results against extensive measurement data gathered in a telecom 
database prototype [12], Additionally, we are currently implementing a distributed 
database testbed to validate the analytical results more thoroughly. 
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Abstract. Middleware tools are generally used to glue together dis- 
tributed, heterogeneous systems into a coherent composite whole. Unfor- 
tunately, there is no clear conceptual framework in which to reason about 
transactional correctness in such an environment. This paper is a first 
attempt at developing such framework. Unlike most existing systems, 
where concurrent executions are controlled by a centralized scheduler, 
we will assume that each element in the system has its own independent 
scheduler receiving input from the schedulers of other elements and pro- 
ducing output for the schedules of yet other elements in the system. In 
this paper we analyze basic configurations of such composite systems and 
develop correctness criteria for each case. Moreover, we also show how 
these ideas can be used to characterize and improve different transaction 
models such as distributed transactions, sagas, and federated database 
transactions. 



1 Introduction 

Gomposite systems consist of several components interconnected by middleware. 
Gomponents provide services which are used as building blocks to define the ser- 
vices of other components as shown in fig. ^ This mode of operation is widely 
used, for instance, in TP-Monitors or GORBA based systems. To achieve true 
plug and play functionality, each component should have its own scheduler for 
concurrency control and recovery purposes. Unfortunately, most existing the- 
ory addresses only the case of schedulers with a single level of abstraction jS|. 
This is unnecessarily narrow: it hinders the autonomy of components and, by 
neglecting to take advantage of the available semantic information, also restricts 
the degree of parallelism. Except for open nested, multi-level transactions H3|, 
almost no attention has been devoted to the case were several schedulers are 
interconnected with the output of one scheduler being used as the input to the 
next. In this paper, we address this problem by determining what information 
a scheduler must provide to another to guarantee global correctness while still 
preserving the autonomy of each scheduler. Based on multilevel |41 1 4j . nested |0|, 
and stack-composite transactions we develop a theory that allows composite 
systems to be understood w.r.t. correct concurrent access. Of all possible con- 
figurations, we consider here only a few important cases: stacks, forks, and joins 
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Stock_Check (ltem#13) Client 



X := QTY_AT_HAND (ltem#13); 

y := PENDING_ORDERS (ltem#13); 

if X < y then 

Order_Amount := y - x + Minimum_Stock; 
— Place_Order (ltem#13, Order_Amount); 



Server 



EXEC SQL SELECT Available 
INTO :QTY 
FROM Stock_List 
WHERE Item = :ltem# 







— ► 


DBMS 



Resource Mar 



Server 



-Fill Out Request Form 
Create Record of Request - 



Document 

Processing 

System 



EXEC SQL SELECT #Ordered 
INTO lAmount 
FROM Orders 
WHERE Item = :ltem# 



Resource Manager 



Server 

EXEC SQL UPDATE Request_List 
SET Amount = :Order_Amount 
WHERE Item = :ltem# 



Resource Manager 



Resource Manager 



Fig. 1. An example transaction in a composite system 



(see fig. El) and simple combinations of them. The idea is to use these as building 
blocks that can later help us to understand and model more complex cases. We 
see the contribution of our paper in providing a formal basis for correctness in 
these architectures. In addition, we show how several classical problems of trans- 
action management can be expressed and explained in a coherent manner using 
the proposed framework, without having to resort to separate models for each 
case. In particular, we prove this by showing that traditional distributed transac- 
tions, sagas, and federated databases (global and local transactions) are special 
cases of our model. The paper is organized as follows. Section El presents the 
transaction model and introduces conflict consistency as our basic correctness 
criterion. Section 0 discusses correct concurrent executions in stacks, forks, and 
join schedules and introduces a simple combination of forks and joins. Section El 
discusses distributed transactions, sagas and federated transactions within the 
framework of composite schedulers. Finally, section E| concludes the paper with 
a brief summary of the results and future work. 



2 Conflict Consistency 



In this section, we present our basic correctness criteria. These serve as a neces- 
sary preparation for the rest of the paper, where it becomes obvious how opera- 
tions of a scheduler act as transactions in other schedulers. In what follows, we 
assume familiarity with concurrency control theory ^ . 



152 



Gustavo Alonso et al. 



2.1 The New Model 

When executing transactions, a scheduler restricts parallelism because it must, 
first, observe the order constraints between the operations of each transaction 
and, second, impose order constraints between conflicting operations of different 
transactions. The restriction in parallelism occurs because, in a conventional 
scheduler, ordered operations are executed sequentially. As shown in m, this 
is too restrictive. It is often possible to parallelize concurrent operations even 
when they conflict as long as the overall result is the same as if they were 
executed sequentially. This requires to relax some of the ordering requirements 
of traditional schedulers. In addition, when several schedulers are involved, a 
mechanism is needed to specify to a scheduler what is a correct execution from 
the point of view of the invoking scheduler. For these two purposes we use the 
notion of weak and strong orders (assumed to be transitively closed): 

Definition 1 (Strong and Weak Order:). Let A and B denote any tasks 
(actions, transactions). 

— Sequential (strong) order: A ^ B , A has to complete before B starts. 

— Unrestricted parallel execution: A||i3, A and B can execute concurrently 
equivalent to any order, i.e., A B or B A. 

— Restricted parallel (weak) order: A < B , A and B can be executed concur- 
rently but the net effect must be equivalent to executing A B. □ 

From here, and independently of the notion of equivalence used, it follows that 
turning a weak order into a strong one leads to correct execution since this im- 
plies sequential execution. In fact, traditional flat schedulers with only one level 
of abstraction, do just this, if two tasks conflict they impose a strong order be- 
tween them. When moving to a composite system with several schedulers, it is 
often possible to impose a weak order instead of a strong one, thereby increas- 
ing the possibility of parallelizing operations. Thus, the aim will be to impose 
either no orders or only weak orders while still preserving global correctness and 
characterize the exchange of information between schedulers in terms of these 
orders. Note that in our model, weak and strong orders are requirements and 
not observed (temporal) orders. A strong order implies a temporal one, but not 
the other way round. In figure |3 is is weakly input-ordered before t\, and the 
serialisation order is according to it. Therefore, the execution is correct. How- 
ever, ti is temporally ordered before t^ (the execution order is indicated by the 
position: left comes first). This shows that the temporal order may be irrelevant. 
In other words, order preservation as postulated in |5| is not required in our 
model. With these ideas, a transaction is now defined as follows. Let O be the 
set of all operations of a scheduler with which transactions can be formed. 

Definition 2 (Transaction). A transaction, t, is a triple (Ot, <t, ^t), where 
Ot is a set of operations taken from O, <t is a partial order on Ot termed the 
weak (intra-)transaction order, and <Ct is a partial order on Ot termed the strong 
(intra-)transaction order. For consistency we require <^t Q <t- LI 
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Since now we have operations being executed at different schedulers, a modified 
notion of conflict is needed. Let CON C O x O he a, conflict predicate that 
expresses whether two operation invocations commute. We say that two opera- 
tions, o and o', commute if there is no difference in return values of the sequence 
a o o' 13 compared to a o' o f3 for all sequences a and (3 with elements from 
o ini. Therefore, commutativity is not an absolute property but is relative to 
the given set of (allowed) operation invocations. From here, we say that two 
operations conflict if they do not commute, which also includes the case when it 
is unknown whether they commute or not. In practice, in a composite system, 
two operations conflict if there is a potential flow of information between them. 
With this, now we have all the necessary elements to formally define a scheduler: 

Definition 3 (Schedule). A schedule S is a five-tuple (T, — <, <C ), where: 

— T is a set of transactions. 

Let O denote the set of all operations ofT’s transactions, i.e., O = UtgT 

— — > and I— *■ are the weak and strong input orders, partial orders over T with 
^ C 

— < and <Si. are the weak and strong output orders, partial orders over O such 
that: 

1. ^t,t' £T,t^t', Vo G OuVo' G Of,CON{o,o') : 

(a) {t^t') ^ (o < o') 

(b) (t'^t) ^ (o' < o) 

(c) otherwise: (o < o') V (o' < o) 

2. (a) Vt G T, Vo, o' G Ot : (o <t o') (o < o'), 

(b) Vt G T, Vo, o' G Ot ■■ (o o') ^ (o < o'), 

3. Whenever t t' , then Vo G Ot,t!o' G Of '■ o o' , 

4 - <■ □ 

According to point ^ a scheduler must weakly order every pair of conflicting op- 
erations without contradicting the weak input order, if any, between the parent 
transactions (otherwise, as we will see below, a cycle would immediately appear). 
Point 0 ensures well-formedness, that is, all weak and strong transaction orders 
are contained in the weak and strong output orders, respectively. Point El prop- 
agates the strong input order from the transactions to their operations, thereby 
separating the execution tree of strongly ordered transactions. Point 0 guaran- 
tees that the output orders of a scheduler are consistent between them. It is not 
stated explicitly that if at least one of two weakly output ordered operations is 
a leaf, then they are also strongly ordered. This is an evident requirement. 

2.2 Conflict Consistency 

The distinction between the two orderings requires to modify the traditional 
notion of correctness. We will assume, as usual, that a transaction executed in 
isolation is correct. 

Definition 4 (Serial Schedule). 

A schedule S is serial if^s is a total order, i.e., \/t,t' G T : {t^t') V (t'l-^t). □ 
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Note that serial schedule does not mean that all operations are executed serially. 
Operations within one transaction can be executed in parallel. In a serial schedule 
we have 1 -^ = ^. Following the classical approach, we can now define the new 
correctness criterion, called conflict consistency. 



Definition 5 (Conflict Consistency (CC)). 

A schedule S is conflict consistent if there is a serial schedule Sger, whose strong 
input and weak output order contain the weak input and output order of S, resp., 
i-e., {^s Q A (<s C <s,„). □ 



The name conflict consistency expresses that the order of all conflicting opera- 
tions must be consistent with the weak input order, i.e., the serialisation order 
must not contradict the weak input order. This can be further formalized as 
follows: 



Definition 6 (Serialisation Graph ^). Given a schedule S, its serialisation 
graph, denoted ^ , is a transitive irreflexive binary relation on T x T, in which 
t^t' is contained iftf^t' and 3o € Ot, 3o' G Of : CON{o, o') A (o < o') □ 



Theorem 1. A schedule S is conflict consistent iff the union of its weak input 
order and its serialisation graph {^s U — >s) is acyclic. □ 

This can be proven by constructing a total order containing the union of weak 
order and serialisation graph, thereby defining a serial schedule with the same 
weak output order (see appendix). This theorem does not take explicitly into 
account the strong input order since it is contained in the weak one (see def. EJ . 

Compared to simplified the definitions: we dropped the notion of equiv- 

alent schedules and we included a clearer presentation of order constraints. 

2.3 Recovery 

For a formal treatment of recovery, we use the unified theory for concurrency 
control and recovery This theory is based on an expanded schedule which 

is used to represent the transaction’s recovery operations explicitly, i.e., in every 
schedule each abort is replaced by the corresponding inverse operations (“un- 
dos”) in the appropriate order. It has been shown that ordering commits in 
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Fig. 4. Stack. 



Fig. 5. Fork. 



the order of conflicting operations and aborts in the opposite direction (SOT 
in leads to correct concurrency control and recovery. In order to achieve 
this, every scheduler must automatically generate the appropriate invocations of 
inverse operations properly ordered in case of a failure. If inverse operations are 
not available we will assume that a scheduler defers the commit of all operations 
that do not have inverses. A more comprehensive treatment of recovery is beyond 
the scope of this paper. We will concentrate here on concurrency control. 



3 Composite Systems 

In a composite system, every server has a scheduler S and provides a set Os 
of operation invocations to be used to build transactions (the server’s services), 
i.e., an operation of a scheduler can be and often will be a transaction of another 
one. Every scheduler S has a commutativity specification expressed by the con- 
flict predicate CONs- Every scheduler works locally ensuring correctness with 
respect to its (local) CONs- The question to address is how to guarantee global 
correctness in such a scenario and what information is needed at each scheduler 
to guarantee global correctness. 



3.1 Stack Schedules 



Stack schedulers take the output of one scheduler and use it directly as input 
to the next (see fig. HI). Transactions can have different depths, but operations 
at the same level are always processed by the same scheduler. This structure 
is a generalization of multi-level and nested transactions [ 1 ( 141311 ^ and it can 
be found in the internal structure of many systems. The notion of stack used 
in this paper is taken from 0. Here, in addition, we prove correctness of stack 
schedules. 



Definition 7 (Stack Schedule (SS)). 

SS, an n-level stack schedule, consists of n schedules S\, . . . , Sn, such that, for 
1 < i < n: 

• Tsi-i = Osi • ^Si-i = <Si • = ^Si n 
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This definition states that operations and their output orders are transactions 
and input orders of the next lower schedule. The important aspect of definition 0 
together with this definition is the way in which orders are propagated. Only 
the strong ordering is automatically propagated to all levels, thereby ensuring 
that if two transactions are strongly ordered, their execution trees will not be 
interleaved at any lower level. The weak order is propagated from one level to 
the next only if the operations involved conflict. If there is no weak input order 
among the parents, but two children operations conflict, the scheduler introduces 
a weak output order. 

Definition 8 (Stack Conflict Consistency (SCC)). An n-level stack sched- 
ule SS is stack conflict consistent iff each individual schedule Si in SS is conflict 
consistent, for 1 < i < n. □ 

Theorem 2. A stack schedule is correct if it is SCC. □ 

We prove this theorem by constructing a serial execution of all transactions of 
all levels, i.e., on all levels all transactions are strongly ordered (see appendix). 

Notice that conflict consistency requires the existence of a serial schedule, 
where each transaction is executed serially, but not necessarily its operations. 
The advantage of SCC is that as long as each level independently enforces CC, 
the overall execution is correct. The weak order constraint and its careful prop- 
agation through the stack is important because often we do not know whether 
operations conflict or not. In these cases, one has to be careful and assume that 
they conflict. In contrast to existing multilevel transaction models 151141 . such 
a weak order constraint is irrelevant if there is no actual conflict at the next 
level. In it is shown that SCC is a larger class than order preserving seri- 
alisability |H] and level-by-level serialisability the two existing comparable 
criteria. Because we allow weak orders within transactions, SCC is also larger 
than multi-level-serialisability (MLSR in |Tl]b 

3.2 Fork Schedules 

In a fork (fig. EJ, the output of a schedule is used as input to several other 
schedulers. Each pair top-level/lower-level scheduler can be seen as a stack and, 
indeed, it will follow the rules defined for stack schedulers. More formally: 

Definition 9 (Fork Schedule (FS)). A fork schedule FS consists of (n-hl) 
schedules Sp, Si, . . . , Sn, such that: 

1- Os,=V}tiTs, 

3. y{oi,Oj),Oi € Osi,Oj € Osj,i j, we assume Oi and oj commute. □ 

The output orders <Sp and <^Sf of Sp are passed to the related component Si 
as input orders. Note that every operation in Sp is sent to only one scheduler 
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Fig. 6. Pure fork. 



Fig. 7. Join. 



Si as a transaction. In “pure” forks we assume that invocation hierarchies are 
completely separated. No information flows from Si to Sj or vice versa. This 
is why we assume that operations commute (point EJ. Therefore, every weak 
order between two operations going to different schedulers Si and Sj disappears. 
However, every weak order between two operations going to the same scheduler 
Si is transformed to an input order in Si, so this scheduler takes care of it. For 
instance, the execution in figure 0 is a correct pure fork. Pure forks are very 
common in practice, as shown in figure Q. 

Definition 10 (Fork Conflict Consistency (FCC)). A fork schedule FS 
is fork conflict consistent (FCC), if the schedule Sp is conflict consistent and 
U"=i i^Si U ^Si) is acyclic. □ 

Theorem 3. An execution in a fork schedule is correct iff it is FCC. □ 

This can be proven by constructing an equivalent stack schedule (see appendix) . 
FCC is a global criterion. Since we want to check for correctness locally, each 
scheduler should be able to decide independently if the schedule is correct or 
not. This can be done as follows: 

Theorem 4 (Criterion for Fork Conflict Consistency). A fork schedule 
FS is fork conflict consistent, iff each of the schedules Sp, Si, ..., S„ is conflict 
consistent. □ 

This can be proven by dividing (^s- U into subgraphs of the different 

schedules and considering their connections to each other (see appendix). 

Forks (fig.lSJ are a straightforward extension of stacks. The criterion for forks 
becomes more complicated when we allow that operation invocation hierarchies 
are not separated, i.e., when after the fork, it is possible to have a join schedule. 
This case is discussed in the next section. 



3.3 Join Schedules 

Join scheduler^O entail several top schedulers whose outputs go to a common, 
single low-level scheduler (figure^), called Sj. Joins are common at the resource 



^ Join schedules are not to be confused with join-transactions introduced in El 
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Fig. 8. Join with two transactions. Fig. 9 . Join with three transactions. 



manager level, where a single resource manager is being accessed by several 
servers simultaneously. Again, each pair top- level /low- level schedule behaves like 
a stack, but care must be taken to avoid inconsistencies. For this purpose, con- 
flicts must be assumed between operations of different higher level schedulers. 
The formal definition of join schedule is a straightforward extension of that of 
stacks: the output of the top-level schedulers (operations, and weak and strong 
orders) is used as the input to the lower-level scheduler (abusing notation, also 
called join schedule). Thus, 



Definition 11 (Join Schedule (JS)). A join schedule JS consists of (n+1) 
schedules Sj, Si, . . . , S'„, such that: 



• ur=i Osi =Tsj • Vi e {1, . . . , n} : Vt, t' G Os, 



( t <Si t' ^ t^Sjt' 

\ t t' 



Intuitively, since the lower level schedule follows the input orders provided, each 
pair top-level/lower-level scheduler behaves like a stack. From a correctness point 
of view, it remains to be determined whether and how to interleave transactions 
from different schedulers. The problems to avoid are illustrated in figures Eland El 

Assume two users, each one operating on one of two top-level schedulers in 
a join. In figure El the information flow from Ti to and back can be detected 
neither by Sj nor nor 82- This problem could be solved if, for instance, Sj 
would build the serialisation graph based on the root transactions, which would 
result in a cycle. In figure El however, this does not help because the information 
flows from T2 to to T\. The cycle can only be detected when considering the 
weak input order between T\ and T2. 

In other words, assume T\ and T2 were directly given to Sj and not to S\ and 
S2 as before. Then Sj would directly reject the execution because it is not CC. 
To capture these situations we introduce the ghost-graph. Let Act(T) represent 
the children of transaction T at all levels below that in which T appears. 



Definition 12 (Ghost-Graph for Join Schedules(< 75)). 

V(T, r') with T G Ts,,T' G Ts^,i yf j the ghost-graph <js is defined as: 
TSjsT' if there are children t,t' ofT,T', resp., with t,t' G T$j and t^Sjt'- 
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Definition 13 (Join Conflict Consistency (JCC)). A join schedule JS is 
join conflict consistent, if the schedule Sj is conflict consistent and ^js U 
U"=i i^Si U ^Si) is acyclic. □ 

This definition matches the intuition that a join schedule should be considered 
correct if the whole stack schedule built from all partial schedules - together 
with the implicit (ghost) orderings between different schedules - is SCC. 

Theorem 5. An execution in a join schedule is correct iff it is JCC. □ 

This can be proven by constructing an equivalent stack schedule that is SCC 
(see appendix). As pointed out above, the idea is that each scheduler will be 
able to make decisions locally. The criterion JCC, however, is a global property 
that cannot be enforced locally. Fortunately we can artificially generate weak 
input orders between all sub-transactions of transactions that come from differ- 
ent schedulers. The weak input order expresses the fact that we must assume a 
conflict between operations of such transactions. Remember that the weak in- 
put order generally stems from output orderings of conflicting operations at a 
level above. Since there is no common level above, we do not know about the 
commutativity of such operations and, therefore, must assume a conflict. This 
idea is formalized as follows: 

Definition 14 (Completed Join Schedule (CJS)). A completed join sched- 
ule CJS is a JS with additional input order compatible with: 

WS„Sj,i^j : (Vt e Ts.,Vt' e Ts^ : t^t') V (Vt S Ts.,Vt' € Ts^ : t'^t) □ 

Theorem 6 (Criterion for JCC). A completed join schedule CJS is JCC, if 
each of the schedules Sj, Si, . . . , Sn is conflict consistent. □ 

This locally testable criterion for JCC can be proven by reducing the ghost-graph 
to weak input orders between transactions of different schedules (see appendix) . 

From here, if each transaction that arrives at Sj contains the information 
about its parent scheduler, Sj can impose the additional weak input orderings. 
Then S,j can decide locally about correctness of the join schedule. In figures 0 
and 13 Sj would impose an order from inc(x,10) and inc(x,6) to inc(x,57o) 
or vice versa. So, a cycle in the union of weak input order and serialisation graph 
of Sj would be detected. 

Note that there exist JCC join schedules that cannot be transformed into 
a correct CJS. Counterexamples are join schedules whose ghost-graph builds a 
cycle “between schedules” but not between transactions. CJS is a purely static 
notion and cannot be used in practice for dynamic scheduling. Dynamic schedul- 
ing in these scenarios is a complex problem which can be addressed by adding 
further restrictions to the execution but which is beyond the scope of this paper. 

3.4 FDBS- Schedules 

The FDBS-schedule will be used later to describe federated databases. A feder- 
ated database can be described by a fork schedule with additional schedules at 
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the same level as S'f, one virtual schedule SLi for each local transaction t^i (see 
figure El): 



Definition 15 (FDBS-Schedule (FDS)). An FDBS-schedule FDS consists 
of (n+l+1 ) schedules Sp, Spi , . . . , Spp Sji , . . . , Sjn, such that: 



1 - = 

2. Vi e {1, : Ts^. = {tpi}, Otl- = {oLi} 

3. Vz G {1, . . . , n} : Vt, V G Ts,„ ■ ^ ^ ^ ^ 



t ^Sf t' 



SjA' 



□ 



FDBS-schedules have the same problems as a join. Thus, to define correctness, 
we use a similar notion: 



Definition 16 (Ghost-Graph for FDBS-Schedules(<FDs)). 

V(T,T') with T G Ts,T' G Ts',S,S' G {Sp, Spi, ■ ■ ■ , Su}. S ^ S' the ghost- 
graph Spps is defined as: 

T Spp)g T' if there are children t, t' ofT, T' , resp., with t, t' G (i G {1, . . . , Z}) 
and t^Sjit' ■ n 

Not surprisingly, FDBS conflict consistency can be defined as join conflict con- 
sistency. 

Definition 17 (FDBS Gonfiict Gonsistency (FDGG)). An FDBS-schedule 
FDS is FDBS conflict consistent, if the schedules Sji . . . Sjn are conflict consis- 
tent and < U {^Sf U ~^Sf) is acyclic. □ 



Theorem 7. An execution in an FDBS-schedule is correct iff it is FDCC. □ 

This can be proven by constructing an equivalent stack schedule (see appendix) . 
Note that serialisation graphs and weak input orders can not appear in the “local 
schedules” Sn since they are only virtual schedules. Thus, since FDCC cannot 
be applied locally we have to seek another criterion. For this, we define: 

Definition 18 (Completed FDBS-Schedule (CFDS)). A completed FDBS- 
schedule CFDS is an FDS with additional conflicts: CONsf '■= CONsf U 
|(o,o') I o G Ot,o' G Of,tff t',t,t' G TsF,^i ■ (o, o' G Tgj,)} 

Note that this implies that Sp has to weakly order these extra conflicts, as re- 
quired by its CC property. □ 

Now, a CFDS is correct (FDCC), if the following holds: 

Theorem 8 (Criterion for FDCC). A completed FDBS-schedule CFDS is 
FDCC, if 

(1) each of the schedules Sp, Spi ■ ■ ■ Spi, Sji . . . Sjn is conflict consistent and 

(2) if$t,t' G Ot,T G Tsf ■ {t,t' G Tsjffj G {1 . . .n}, 3tpi .. .tpk & Ui=i ^SLi ■ 

■ ■ - ^SjjtLk^Sjff'))- n 

The last condition means that in any Sjj no local transactions must be serialised 
between global (sub-)transactions of the same parent (=transaction in Sp). We 
prove this theorem by assuming a cycle in the union of input, serialisation and 
ghost-order and showing a contradiction to CC of 5^ (see appendix). 
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Fig. 10. An MDBS with par- Fig. 11. FDBS: example, 

tially non-conflicting subtrans- 
actions. 



4 Existing Composite Transaction Models 

In this section we describe several transaction models using the ideas above. In 
particular, we consider distributed transactions, sagas, and federated database 
transactions. These models are two-level models: transactions consist of sub- 
transactions considered as operations at the top-level, and executed as transac- 
tions at the lowest level. Top-level transactions are called global transactions, 
operations in global transactions are global subtransactions and transactions 
circumventing the global layer are called local transactions. 



4.1 Distributed (Multi-database) Transactions 

The model of interest here is often referred to as multidatabase system (MDBS) . 
Following the usual terminology, Ti{i = 1,2, ...,n) are global transactions. Each 
global transaction is decomposed into global subtransactions, encompassing 
all operations to be executed at the component databases, Sj{j = l,...,m). 
For FS, the set Tsp is the set of global transactions and Osp is the union of all 
subtransactions Uj . For the weak and strong input order we set ->-Sf = = 0- 

For the bottom schedulers Sj , the set of transactions are the subtransactions 
tij ■ Osj is the set of operations at Sj to be executed for all tij . In the classical 
treatment of distributed transactions nothing is known about the commutativity 
of subtransactions. This is expressed in the scheduler Sp by assuming a conflict 
between every pair of subtransactions: Uj conflicts with trs i r A j = s. 
Therefore Sp orders all such pairs the same way: For all ii,i 2 G {l,2,...n} : 
If Otherwise ‘^Sp would by cyclic. Therefore the 

weak input order imposed to all component schedulers is the same for all Sj. 
According to FCC all Sj are CC. From there FCC - in this special MDB setting 
- requires that all serialisation orders be the same. 

This result clearly is not surprising and corresponds to the usual understand- 
ing. The advantage of our treatment is that several points become clear. First, 
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without further semantic knowledge the technique cannot be improved. Second, 
and more importantly, as soon as the Sp scheduler has information about the 
commutativity of the subtransactions, we can drop the related weak order con- 
straint imposed to the corresponding Sj scheduler. Such an example is shown 
in figure cni where there is no conflict between ti and f 4 , thus allowing the 
serialisation order from to ti. 

Now, with respect to recovery, usually we do not assume to have inverses for 
global subtransactions. Therefore Sp must defer the commit of every subtrans- 
action tij up to the commit of the global transaction Ti. This calls for atomic 
commitment as it is known. However, we can easily prove that, as soon as there 
are inverses known to some the scheduler can early commit such tij and, 
in case of failure, do recovery at the Sp level. 



4.2 Sagas 

A saga pj Ti is a logical unit of work and shall be executed as a whole or not 
at all. It consists of a (partially) strongly ordered set of (sub-) transactions tij. 
Each transaction has an inverse for compensation in case of recovery. With 
respect to concurrency, a saga does allow any kind of interleaving w.r.t. other 
concurrent sagas. 

Let us describe sagas in a two-level stack or fork schedule consisting of a top- 
level schedule and one or more bottom-level schedules. The top-level scheduler 
assumes commutativity of all transaction pairs (tij,trs),i ^ r. Therefore no 
weak output order is generated and imposed as input to the bottom schedulers. 
No constraints are imposed over the serialisation orders of the subtransactions 
executed in Sj. 

Again, this result is not surprising. However, as before we can easily incor- 
porate the fact that not all transactions commute and not all transactions have 
inverses. 

In comparison with distributed transactions and in summing up we have 
the following observation: Distributed transactions correspond to a fork schedule 
where the transactions of different sagas conflict and no inverse transaction is 
known. Sagas correspond to the case where the transactions of different sagas 
commute and every transaction has an inverse. From here, it is clear that our 
general fork schedule encompasses all cases “in between” these two extremes, 
i.e., if some transactions commute or if some transactions have inverses. 

4.3 Federated Transactions 

Federated transactions are a generalization of MDB transactions in that local 
transactions tp^ circumvent the top-level scheduler and directly enter the bot- 
tom scheduler Sjj (fig- ini)- We have modelled this by a fork schedule with 
an (artificial) additional schedule Spk at the same level as Sp for every local 
transaction Tpk, which we called an FDBS-Schedule ((section 1,3.411 . A correct 
FDBS is ensured by having the FDCC property, as mentioned earlier. Given 
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this property, we easily relate special results obtained in the context of multi- 
level transactions m There a “dynamic” conflict relation to be used by the 
top-level scheduler was introduced in order to capture indirect conflicts between 
local transactions and global subtransactions. This corresponds to the additional 
edges of the ghost-graph in Sp of definition m 

As a practical method, H2] have proposed to introduce two kinds of com- 
mits: a commit with respect to other global subtransactions at the end of every 
subtransaction and a commit with respect to local transactions at the end of 
the global transaction. As a simple method for an implementation with locking, 
it was proposed to have “retained” locks, i.e. locks that are kept until the end 
of the global transaction to shield against local transactions. Such locks are not 
visible to other global subtransactions. The effect of this locking scheme is that 
no ghost order can arise between active transactions. As a result, the ghost order 
represents a temporal relationship reflecting the fact that two transactions were 
executed serially with respect to each other, making cycles impossible. 

The FDCC property also allows a simple explanation of the ticket method 
for federated databases |B|. In this method, all global transactions have at most 
one operation at each local (join) site, hence there is no need for the join to check 
whether a local transaction is serialized between two global subtransactions of 
the same root. Furthermore, a global transaction must read and increase the 
value of a counter at each site. A global commit is done only if there is no 
cycle based on the ticket orders between different global transactions. Because 
there is no input order in traditional schedulers, this method has to rely on the 
ticket values to determine the serialization order at the local databases. The 
assumption that every pair of operations conflict (see definition E) is translated 
here in the fact that a value has to be updated by each transaction, forcing all 
of them to conflict. 



5 Conclusion 

In this paper, we have developed a framework to reason about correctness in 
composite systems. Our main goal has been to allow individual components to 
decide locally while still ensuring global correctness and to determine what in- 
formation needs to be provided to each scheduler in order to do so. We use the 
notion of conflict consistency to restrict correct executions to those consistent 
with the information passed to a scheduler (weak and strong input orders). Based 
on conflict consistency, correctness criteria for several configurations (stack, fork, 
and join) have been developed. These criteria characterize all correct and only 
correct executions. In particular, if the composite system has a tree configura- 
tion (no joins), it suffices to enforce conflict consistency locally to obtain global 
correctness. If joins are involved, however, additional restrictions are necessary 
since the criteria provided are based on non-local information (the ghost graph) . 
As the example with federated transactions shows, it is still possible to design 
dynamic criteria for configurations containing joins in spite of the static na- 
ture of the criterion for joins. As an additional contribution, we have described 
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known transaction mechanisms (distributed transactions, sagas, and federated 
transactions) as special cases of the composite framework and from there, we 
have demonstrated how more general mechanisms can be proven correct. 

Future work involves dynamic schedulers including recovery within the com- 
posite framework and exploring more complex configurations, always with the 
trade-off in mind between passing global information and deciding locally. 
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6 Appendix 

This section contains the proofs to the theorems in the paper. 

Theorem 1. A schedule S is conflict consistent iff the union of its weak input 
order and its serialisation graph U ~^s) acyclic. □ 

This theorem was proven in where we used slightly modified definitions. 
The idea is to construct a total order containing the union of weak order and 
serialisation graph, and to find a serial schedule with the same weak output 
order. 

Theorem 2. A stack schedule is correct if it is SCC. 

Proof. We construct by induction a serial execution of all transactions of all 
levels, i.e., on all levels all transactions are strongly ordered: 

We can formulate the induction in two steps: 

(1) Whenever a serial schedule Sffi can be constructed which orders conflicting 
operations as <Si-i serialises the transactions in Ti_i as they are ordered 
in <Si the following holds: 

(2) We can construct a serial schedule S[ which orders conflicting operations as 
<Si and serialises the transactions in as they are ordered in <Si+i 

Point (1) is true since: 

0 <Si o' => o^Si-iO' => ~^{o'^Si-iO), as otherwise there would be a cycle in 

U Thus, the transactions of are serialised as <Si- 

Point (2) is true since: 

Si is CC 

=> i^Si U ->-Si) 'Is acyclic 

=> U ~^Si) can be completed to an acyclic total order which defines the 

order of the transactions of a serial schedule S[ which orders conflicting opera- 
tions as <Si- 

As (1) inductively implies (2), what remains is to show the induction hypothesis. 
There is a serial schedule S[ which orders conflicting operations as <s^ since 
all its operations are executed sequentially. Therefore, (2) holds without (1) for 

1 = l. □ 

Theorem 3. An execution in a fork schedule is correct iff it is FCC. 

Proof. For every fork schedule FS a semantically equivalent stack schedule SS 
consisting of SSi and SS 2 can be constructed with: 

SS2 := Sf] Tssi ■= Osp, Ossi ■= {Si=iOsi, ^ssi '■= Ur=i^S'i, := 

ur=i-5., 

CONsSi ■= U"=iC'07Vs. (i.e., Vo G t,o' G t',t G Tsi,t' G yf j : 

^CONssAo,o')), 

<ssi-= ur=i <Si, <-ssi-= ur=i ^Si- 

Now, SS 2 is CC iff Sp is, and SSi is CC iff UILi U — is acyclic. 
Because the fork is FCC then the equivalent stack is SCC. □ 

The locally testable criterion for FCC can be proven by dividing {^Si U ~^Si) 
into subgraphs of the different schedules and considering their connections to 
each other: 
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Theorem 4 (Criterion for Fork Conflict Consistency). A fork schedule 
FS is fork conflict consistent, iff each of the schedules Sp, S\, . . . , Sn is conflict 
consistent. 

Proof. (Only if). From FCC follows that {^Si U — is acyclic Vi G { 1 , . . . , n}. 
As all those partial graphs {'^Si U — are pairwise unconnected and therefore 
must also be acyclic, each one of the Si schedules is CC. 

(If). If each Si is CC, all partial graphs {^Si 0 (* € { 1 , . . . ,n}) are acyclic. 

Since there are no conflicts between operations in different Si, no serialisation 
graph edge connects different Si. Hence, all these subgraphs are unconnected and 
the union of all of all those graphs remains acyclic and, thus, FS is FCC. □ 



Theorem 5. An execution in a join schedule is correct iff it is JCC. 

Proof. (If). Let .<.js be the transitive closure o/ < js U U”=l 

Then, for every JCC join schedule JS a semantically equivalent stack schedule 

SS consisting of SS\ and SS2 can be constructed with: 

SSi := Sj] Oss2 ■= Tsj, Tss2 ■= U"=i^Sii ^SS2 ■= <.ss2-= 

ur=i «s. 

CONSS2 ■= ur=i CONsi u {{0,0') I oG Ot,o' G Of,t G Tsi,t' G Ts,j,i yf j}, 
^SS2 ■= Ur=i ^Si U {{t,t') I ( 3 o e Ot, 3 o e Of ■ CONss 2 {o, o')) A ^{t'\<-.jst), 
such that -^882 is acyclic}. 

Clearly, it is possible to construct an acyclic total order —>-882 “■s H is not contra- 
dicting :<:j8, which is acyclic, and every acyclic partial order can be completed 
to an acyclic total order. Note that SS2 has an output order that directly follows 
from the definition of the new conflicts and its input order. As '^882 cannot 
contradict -^SS2> {^882 0 -^882) is olso acyclic and from SSi is CC follows 
that SS is see. 

(Only if). For a join schedule JS a semantically equivalent stack schedule SS 
consisting of SSi and SS2 is constructed with: 

SSi := Sj; Oss2 ■= Tsj, T882 ■= [Si=iT8i, ^882 '■= ^882 '■= 

ULi-s., 

CON882 ■= U"=i CON8i u {{0,0') I oG Ot,o' G Of,t G T8i,t' G T8,j,i yf j}. 

Be SS see and have <js U Ur=i i^Si 0 ~^Si) « cycle. 

Case ( 1 ): From T<j8T' follows: 

3 t G OT, 3 t' G Ot' ■ t^88it',CONss2{tfl') 

=> t <882 t' ,t->-s8it' , as otherwise SSi not CC 
^ T^S82T' . 

Case (2): From T^8iT' follows: T— 

Case ( 3 ): From T’-^8iT' follows: T^ss^T'. 

Thus, there is a cycle in {^882 0 -^882) "which contradicts SCC. 

Note that in both directions CC of SS\ is equivalent to CC of Sj. □ 



Theorem 6 (Criterion for JCC). A completed join schedule CJS is JCC, if 
each of the schedules Sj, Si, . . . , Sn is conflict consistent. 

Proof. Be G := <js U Ur=i (^Si U 

If all schedules are CC, all subgraphs G|s^ (= restriction of G on nodes of Si) 
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are acyclic as ^js~edges exist only between subgraphs. Furthermore, every edge 
between subgraphs is a <js~edge. 

Assume now a cycle in G. Then, because of the above reasons, there must be a 
cycle between subgraphs, in general: 

G|si <js ■■■ <js G|s„ <js G|si. 

Then there exist transactions in T$j with: 

t[^t2, t'„^h. 

From definition m and CC follows that there exist weak input orders ^ U+i 
and also 
ti ti 

I.e., there is a cycle G ^2 ^ t„ —>■ ti, which is a contradiction to CC 

ofSj. □ 



Theorem 7 . An execution in an FDBS-schedule is correct iff it is FDCC. 
Proof. (If). Let :<:js be the transitive closure of <fds U Ur=i (^Si ~^Si)- 
Then, for every FDCC join schedule FDS a semantically equivalent stack sched- 
ule SS consisting of SS\ and SS2 can be constructed, whereby SS\ looks like: 

Tssi ■= Ur=i Ossi := {o e Ot \te TssJ, ^ssi ■= U”=i ^Sji, 

CONss, ■■= UtiCONsj,. 

Then SS2 can be defined as: 

Oss2 '■= Ur=i Tss2 ■= {i I 3 o e Oss2 ■ o e Ot}, <-^ss2 ■= 

CONSS2 ■■= CONsf u {{0,0') \ O £ Ot, o' e Ot^,t e Ts,t' e Ts>,S,S' e 
{SF,SLl,...,Su},SffS'}, 

— >SS 2 •= ~^Sf U {{t,t') I ( 3 o G Ot, 3 o G Of : CONss2{o,o')) A ^{t':<:FDst), 
such that -^882 is acyclic}. 

Clearly, it is possible to construct an acyclic total order -^382 it is not contra- 
dicting :<\fd3, which is acyclic, and every acyclic partial order can be completed 
to an acyclic total order. Note that SS2 has an output order that directly follows 
from the definition of the new conflicts and its input order. As ^882 cannot 
contradict ^SS2> {^882 U -^882) is also acyclic and from SSi is CC follows 
that SS is see. 

(Only if). For an FDBS-schedule FDS a semantically equivalent stack schedule 
SS consisting of SS\ and SS2 is constructed with SSi looking like: 

T8S1 ■= U "=1 Tsji, O881 := {o G Ot \ t€ TssJ, ^S3i ■= U ”=1 ^Sji, 

CON33, ■■= [JtiCON 8 j^. 

Then SS2 can be defined as: 

Oss2 ■= Ur=i T332 ■= {i I 3 o G Os 32 ■ o G Ot}, <-^882 ■= ^882 ■= 

^Sf 1 

CON 8 82 •= CON8f U {{0,0') I o G Ot,o' G Of ,t G T8,t' G T8',S,S' G 
{SF,SLl,...,Su},SffS'}. 

Be SS see and have Sff>s U {^Sf U ~^Sf) ® cycle. 

Case (1): From T<fdsT' follows: 

3t G OT,3t' G Ot' ■ t^88it',CONss2{tff') 

=> t <882 t' ,t->-sSit' , as otherwise SSi not CC 
=A T^ss^T'. 
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Case (2): From T^SpT' follows: T^ss^T'' ■ 

Case (3): From follows: T^sSqT' ■ 

Thus, there is a cycle in {^ss^ U -^sSq) "which contradicts SCC. 

Note that in both directions CC of SS\ is equivalent to CC of Sji ... Sjn, as 
every Sji is CC and there is neither an input order between Sji . . . Sjn nor an 
output order because there are no conflicts. □ 



Theorem 8 (Criterion for FDCC). A completed FDBS-schedule CFDS is 
FDCC, if 

(1) each of the schedules Sp, Spi ■ • ■ Su, Sji . . . Sjn is conflict consistent and 

(2) if$t,t' G Ot,T € Tsf ■ {t,t' e Tsjflj e {1 . . .n}, 3tpi ...tpk^ U!=i ^SLi ■ 

■ ■ ■ ^SjjtLk^Sjfl'))- 
Proof. Be G := <fds U Ufci U 

If all schedules are CC, all subgraphs G\si are acyclic as ^FDS~edges exist only 
between subgraphs. Furthermore, every edge between subgraphs is a "^FDS-odge. 
Assume now a cycle in G. Then, because of the above reasons, there must be a 
cycle between subgraphs, in general (let :<:s := U — >s))' 

Tfii .<.Sf ■■■ ■<'Sf ^FDS Till ^FDS ■■■ ^FDS Tilfei ^FDS Tf21 

■<-Sf ■■■ ■<'Sf Tf2j2 ^FDS Tf 21 ^FDS ■■■ ^FDS Tp^k,,, ^FDS Tpil. 

We know that all pairs of transactions in this cycle are different from each other, 
as otherwise the join schedule at which their subtransactions execute would detect 
a contradiction to condition (2) of this criterion. 

Then there exist transactions tpii S Tpii, . . ■ with: 

{tpiji ^Sji tpil ^S.,1 ^Sji tpiki ^Sji tF2l),---, 

{tFm,j„, ^Sj,, tLml ^Sj„ ■ ■ ■ tpil) 

There exist (a.o.) the following conflicts in Sp: 

{tpij,^,tF2l'), . . . , {tpm—l,jm — i 1 tp^al ) , 5 ^-Fll ) ■ 

Conflicting operations are weakly output ordered, and as they are transactions 
in one Sjj, resp., also weakly input ordered. 

=> Tfii '<'Sf ■■■ ■<'Sf Tfiji ^Sf Tf2i---- 

This is a cycle in {^Sf U ->-Sf)^ hence, a contradiction to CC of Sp. Thus, 
must be acyclic and CFDS is FDCC. 



□ 
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Abstract. In this paper we consider databases representing informa- 
tion about moving objects (e.g. vehicles), particularly their location. We 
address the problems of updating and querying such databases. Specih- 
cally, the update problem is to determine when the location of a moving 
object in the database (namely its database location) should be updated. 
We answer this question by proposing an information cost model that 
captures uncertainty, deviation, and communication. Then we analyze 
dead-reckoning policies, namely policies that update the database loca- 
tion whenever the distance between the actual location and the database 
location exceeds a given threshold, x. Dead-reckoning is the prevalent 
approach in military applications, and our cost model enables us to de- 
termine the threshold x. Then we consider the problem of processing 
range queries in the database, and we propose a probabilistic algorithm 
to solve the problem. 



1 Introduction 

1.1 Background 

Consider a database that represents information about moving objects and their 
location. For example, for a database representing the location of taxi-cabs a 
typical query may be: retrieve the free cabs that are currently within 1 mile of 
33 N. Michigan Ave., Chicago (to pick-up a customer); or for a trucking company 
database a typical query may be: retrieve the trucks that are currently within 1 
mile of truck ABT312 (which needs assistance); or for a database representing 
the current location of objects in a battlefield a typical query may be: retrieve the 
friendly helicopters that are in a given region, or, retrieve the friendly helicopters 
that are expected to enter the region within the next 10 minutes. The queries 
may originate from the moving objects, or from stationary users. We will refer 
to the above applications as MOtion-Database (MOD) applications or moving- 
objects-database applications. In the military, MOD applications arise in the 
context of the digital battlefield (see [II .31 1 2 \ ) . and in the civilian industry they 
arise in transportation systems. 
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Currently, MOD applications are being developed in an ad hoc fashion. 
Database management system (DBMS) technology provides a potential foun- 
dation for MOD applications, however, DBMS’s are currently not used for this 
purpose. The reason is that there is a critical set of capabilities that have to be 
integrated, adapted, and built on top of existing DBMS’s in order to support 
moving objects databases. The added capabilities include, among other things, 
support for spatial and temporal information, support for rapidly changing real 
time data, new indexing methods, and imprecision management. The objective 
of our Databases fOr MovINg Objects (DOMINO) project is to build an envelope 
containing these capabilities on top of existing DBMS’s. 

In this paper we address the imprecision problem. The location of a moving 
object is inherently imprecise because, regardless of the policy used to update 
the database location of a moving object (i.e. the object-location stored in the 
database), the database location cannot always be identical to the actual location 
of the object. There may be several location update policies, for example, the 
location is updated every x time units. In this paper we address dead-reckoning 
policies, namely policies that update the database whenever the distance between 
the actual location of a moving object m and its database location exceeds a 
given threshold h, say 1 mile. This means that the DBMS will answer a query 
’’what is the current location of m?” by an answer A: ’’the current location 
is (x,y) with a deviation of at most 1 mile” . Dead-reckoning is the prevalent 
approach in military applications. 

One of the main issues addressed in this paper is how to determine the update 
threshold h in dead-reckoning policies. This threshold determines the location 
imprecision, which encompasses two related but different concepts, namely devi- 
ation and uncertainty. The deviation of a moving object m at a particular point 
in time t is the distance between m’s actual location at time t, and its database 
location at time t. For the answer A above, the deviation is the distance be- 
tween the actual location of m and (x,y). On the other hand, the uncertainty 
of a moving object to at a particular point in time t is the size of the area in 
which the object can possibly be. For the answer A above, the uncertainty is 
the area of a circle with radius 1 mile. The deviation has a cost (or penalty) in 
terms of incorrect decision making, and so does the uncertainty. The deviation 
(resp. uncertainty) cost is proportional to the size of the deviation (resp. uncer- 
tainty) . The ratio between the costs of an uncertainty unit and a deviation unit 
depends on the interpretation of an answer such as A above, as will be explained 
in section 3. 

In MOD applications the database updates are usually generated by the 
moving objects themselves. Each moving object is equipped with a Geographic 
Positioning System (GPS), and it updates its database location using a wireless 
network (e.g ARDIS, RAM Mobile Data Go., IRIDIUM, etc.). This introduces a 
third information cost component, namely communication. For example, RAM 
Mobile Data Go. charges a minimum of 4 cents per message, with the exact cost 
depending on the size of the message. Furthermore, there is a tradeoff between 
communication and imprecision in the sense that the higher the communication 
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cost the lower the imprecision and vice versa. In this paper we propose a model 
of the information cost in moving objects databases, which captures imprecision 
and communication. The tradeoff is captured in the model by the relative costs 
of an uncertainty unit, a deviation unit, and a communication unit. 



1.2 Location Update Policies 

Consider an object m moving along a prespecified route. We model the database 
location of m by storing in the database m’s starting time, starting location, and 
a prediction of future locations of the object. In this paper the prediction is given 
as the speed v of the object. Thus the database location of m can be computed 
by the DBMS at any subsequent point in time. 0 This method of modeling the 
database location was originally introduced in via the concept of a dynamic 
attribute; the method is modified here in order to handle uncertainty. The actual 
location of a moving object m deviates from its database location due to the fact 
that m does not travel at the constant speed v. 

A dead-reckoning update policy for m dictates that there is a database-update 
threshold th, i.e. a deviation for which m should send a location/speed update to 
the database. (Note that at any point in time, since m knows its actual location 
and its database location, it can compute its current deviation. ) Speed dead- 
reckoning^ (sdr) is a dead-reckoning policy in which the threshold th is fixed for 
the duration of the trip. 

In this paper we introduce another dead-reckoning update policy, called adap- 
tive dead reckoning (adr). Adr provides with each update a new threshold th that 
is computed using a cost based approach, th minimizes the total information cost 
per time unit until the next update. The total information cost consists of the 
update cost, the deviation cost, and the uncertainty cost. In order to minimize 
the total information cost per time unit between now and the next update, the 
moving object m has to estimate when the next update will occur, i.e. when the 
deviation will reach the threshold. Thus, at location update time, in order to 
compute the new threshold, adr predicts the future behavior of the deviation. 
The thresholds differ from update to update because the predicted behavior of 
the deviation is different. 

A problem common to both sdr and adr is that the moving object may be 
disconnected from the network. In other words, although the DBMS ’’thinks” 
that updates are not generated since the deviation does not exceed the up- 
date threshold, the actual reason is that the moving object is disconnected. To 
cope with this problem we introduce a third policy, “disconnection detecting 

^ Our simulation experiments show that, even when the speed fluctuates sharply, this 
temporal technique reduces the number of updates to 15% of the number used by 
the traditional, nontemporal method in which the database simply stores the latest 
known location for each object; this saves 85% of the location-updates overhead. 

^ We use the term speed dead-reckoning to contrast it with the plain dead-reckoning 
(pdr) policy in which the database location is fixed until it is explicitly updated by 
the moving object; namely, pdr does not use dynamic attributes. 
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dead-reckoning (dtdr)”. The policy avoids the regular process of checking for 
disconnection by trying to communicate with the moving object, thus increas- 
ing the load on the low bandwidth wireless channel. Instead, it uses a novel 
technique that decreases the uncertainty threshold for disconnection detection. 
Thus, in dtdr the threshold continuously decreases as the time interval since the 
last location update increases. It has a value K during the first time unit after 
the update, it has value K /2 during the second time unit after the update, it 
has value K/i during the third time unit, etc. Thus, if the object is connected, 
it is increasingly likely that it will generate an update. Conversely, if the moving 
object does not generate an update, as the time interval since the last update 
increases it is increasingly likely that the moving object is disconnected. The 
dtdr policy computes the K that minimizes the total information cost, i.e. the 
sum of the update cost, the deviation cost, and the uncertainty cost. 

To contrast the three policies, observe that for sdr the threshold is fixed for all 
location updates. For adr the threshold is fixed between each pair of consecutive 
updates, but it may change from pair to pair. For dtdr the threshold decreases 
as the period of time between a pair of consecutive updates increases. 

We compared by simulation the three policies introduced in this paper. Our 
simulations indicate that adr is superior to sdr in the sense that it has a lower or 
equal information cost for every value of the update-unit cost, uncertainty-unit 
cost, and deviation-unit cost. Adr is superior to dtdr in the same sense; the dif- 
ference between the costs of the two policies quantifies the cost of disconnection 
detection. For some parameters combinations the information cost of sdr is six 
times as high as that of adr. 

Finally, an additional contribution of this paper is a probabilistic model and 
an algorithm for query processing in motion databases. In our model the location 
of the moving object is a random variable, and at any point in time the database 
location and the uncertainty are used to determine a density function for this 
variable. Based on this model we developed an algorithm that processes range 
queries such as Q=‘retrieve the moving objects that are currently inside a given 
region R’. The answer to Q is a set of objects, each of which is associated with 
the probability that currently the object is inside R. 

The rest of this paper is organized as follows. In section 2 we introduce the 
data model and discuss location attributes of moving objects. In section 3 we 
discuss the information cost of a trip, and in section 4 we introduce our approach 
to cost optimization. In section 5 we describe the three location update policies. 
In section 6 we present our approach to probabilistic query processing. In section 
7 we discuss relevant work, and in the last section we summarize our results. 



2 The Data Model 



In this section we define the main concepts used in this paper. A database is a set 
of object-classes. An object-class is a set of attributes. Some object-classes are 
designated as spatial. Each spatial object class is either a point-class, a line-class. 
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or a polygon-class in two-dimensional space (all our concepts and results can be 
extended to three-dimensional space). 

Point object classes are either mobile or stationary. A point object class O 
has a location attribute L. If the object class is stationary, its location attribute 
has two sub-attributes L.x, and L.y, representing the x and y coordinates of the 
object. If the object class is mobile, its location attribute has six sub-attributes, 
L.route, L.startlocation, L. starttime, L. direction, L. speed, and L .uncertainty . 

The semantics of the sub-attributes are as follows. L.route is (the pointer 
to) a line spatial object indicating the route on which an object in the class O 
is moving. Although we assume that the objects move along predefined routes, 
our results can be extended to free movement in space (e.g. by aircraft). We will 
comment on that option in the last paragraph of this section. L.startlocation is 
a point on L.route, it is the location of the moving object at time L. starttime. 
In other words, L. starttime is the time when the moving object was at loca- 
tion L.startlocation. We assume that whenever a moving object updates its L 
attribute it updates the L.startlocation subattribute; thus at any point in time 
L. starttime is also the time of the last location-update. We assume in this paper 
that the database updates are instantaneous, i.e. valid- and transaction- times 
(see mi) are equal. Therefore, L. starttime is the time at which the update 
occurred in the real world system being modeled, and also the time when the 
database installs the update. L. direction is a binary indicator having a value 0 or 
1 (these values may correspond to north-south, or east- west, or the two endpoints 
of the route). L. speed is a function that represents the predicted future locations 
of the object. It gives the distance of the moving object from L.startlocation as 
a function of the number t of time units elapsed since the last location-update, 
namely since L. starttime. The function has the value 0 when t = 0. In its sim- 
plest form (which is the only form we consider in this extended abstract) L. speed 
represents a constant speed v, i.e. the distance is v -t.^L. uncertainty is either a 
constant, or a function of the number t of time units elapsed since L. starttime. 
It represents the threshold on the location deviation (the deviation is formally 
defined at the end of this section); when the deviation reaches the threshold, the 
moving object sends a location update message. Observe that the uncertainty 
may change automatically as the time elapsed since L. starttime increases; this 
is indeed the case for the dtdr policy. 

We define the route- distance between two points on a given route to be the 
distance along the route between the two points. We assume that it is straightfor- 
ward to compute the route-distance between two points, and the point at a given 
route-distance from another point. The database location of a moving object at 
a given point in time is defined as follows. At time L. starttime the database 
location is L.startlocation; the database location at time A. starttime -I- t is the 

® Another possibility for representing future locations is a sequence of speeds, i.e., the 
object will move at speed until time t\, at speed V 2 until time t 2 , etc. Such a future 
plan is typical of, for example, a vehicle that expects various traffic conditions; or a 
package that first travels by truck, then by plane, then waits (speed 0) for another 
truck loading, etc. 
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point {x,y) which is at route-distance L.speed-t from the point L.startlocation. 
Intuitively, the database location of a moving object m at a given time point t is 
the location of m as far as the DBMS knows; it is the location that is returned by 
the DBMS in response to a query entered at time t that retrieves m’s location. 
Such a query also returns the uncertainty at time t, i.e. it returns an answer of 
the form: m is on L. route at most L. uncertainty ahead of or behind (x,y). 

Since between two consecutive location updates the moving object does not 
travel at exactly the speed L. speed, the actual location of the moving object 
deviates from its database location. Formally, for a moving object, the deviation 
d at a point in time t, denoted d{t), is the route-distance between the moving 
object’s actual location at time t and its database location at time t. The devi- 
ation is always nonnegative. At any point in time the moving object knows its 
current location, and it also knows all the subattributes of its location attribute. 
Therefore at any point in time the (computer onboard the) moving object can 
compute the current deviation. Observe that at time L. starttime the deviation 
is zero. 

At the beginning of the trip the moving object updates all the sub-attributes 
of its location attribute. Subsequently, the moving object periodically updates 
its current location and speed stored in the database. Specifically, a location 
update is a message sent by the moving object to the database to update some 
or all the sub-attributes of its location attribute. The moving object sends the 
location update when the deviation exceeds the L .uncertainty threshold, or 
when the moving object changes route or direction. The location update message 
contains at least the values for L. speed and L.startlocation. Obviously, other 
subattributes can also be updated. The subattribute L. starttime is written by 
the DBMS whenever it installs a location update; it denotes the time when the 
installation is done. 

Before concluding this section we would like to point out that the results 
of this paper hold for free-movement modeling, i.e. for objects that move freely 
in space (e.g. aircraft) rather than on routes. In this case L.route is an infinite 
straight line (e.g. 60 degrees from the starting point) rather than a line-object 
stored in the database. Then there are two possibilities of modeling the uncer- 
tainty. The first is identical to the one described above, i.e. the uncertainty is 
a segment on the infinite line representing the route. In this case every change 
of direction constitutes a change of route, thus necessitating a location update. 
The second possibility is to redefine the deviation to be the Euclidean distance 
between the database location and the actual location, and to remove the re- 
quirement that the object updates the database whenever it changes routes. 
In this case L .uncertainty defines a circle around the database location, and 
a query that retrieves the location of a moving object m returns an answer of 
the form: m is within a circle having a radius of at most L .uncertainty from 
{x,y). Observe that the second possibility of modeling uncertainty necessitates 
less location updates, but the answer to a query is less informative since the 
uncertainty is given in two dimensional space rather than one-dimensional. 
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3 The Information Cost of a Trip 

In this section we define the information cost model for a trip taken by a moving 
object m, and we discuss information cost optimality. 

At each point in time during the trip the moving object has a deviation and 
an uncertainty, each of which carries a penalty. Additionally the moving object 
sends location update messages. Thus the information cost of a trip consists of 
the cost of deviation, cost of communication, and cost of uncertainty. 

Now we define the deviation cost. Observe first that the cost of the deviation 
depends both on the size of the deviation and on the length of time for which it 
persists. It depends on the size of the deviation since decision-making is clearly 
affected by it. To see that it depends on the length of time for which the deviation 
persists, suppose that there is one query per time unit that retrieves the location 
of a moving object m. Then, if the deviation persists for two time units its 
cost will be twice the cost of the deviation that persists for a single time unit; 
the reason is that two queries (instead of one) will pay the deviation penalty. 
Formally, for a moving object m the cost of the deviation between two time 
points t\ and t2 is given by the deviation cost function, denoted COSTd{t\,t2)', 
it is a function of two variables that maps the deviation between the time points 
t\ and t2 into a nonnegative number. In this paper we take the penalty for each 
unit of deviation during a unit of time to be one (1). Then, the cost of the 
deviation between two time points t\ and t2 is: 

COSTd{h,t 2 )= r d{t)dt ( 1 ) 

Jtl 

The update cost, denoted Ci, is a nonnegative number representing the cost 
of a location-update message sent from the moving object to the database. This 
is the cost of the resources (i.e. bandwidth and computation) consumed by the 
update. The update cost may differ from one moving object to another, and 
it may vary even for a single moving object during a trip, due for example, to 
changing availability of bandwidth. The update cost must be given in the same 
units as the deviation cost. In particular, if the update cost is C\ it means the 
ratio between the update cost and the cost of a unit of deviation per unit of 
time (which is one) is C\. It also means that the moving object (or the system) 
is willing to use 1/C\ messages in order to reduce the deviation by one during 
one unit of time. 

Now we define the uncertainty cost. Observe that, as for the deviation, the 
cost of the uncertainty depends both, on the size of the uncertainty and on the 
length of time for which it persists. Formally, for a moving object m the cost of 
the uncertainty between two time points t\ and t2 is given by the uncertainty 
cost function, denoted COSTu{t\,t 2 )', it is a function of two variables that maps 
the uncertainty between the time points t\ and t2 into a nonnegative number. 
Define the uncertainty unit cost to be the penalty for each unit of uncertainty 
during a unit of time, and denote it by C 2 ■ Then, the cost of the uncertainty of 
TO between two time points t\ and ^2 is: 
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COSTu{h,t2) = [ ' C2u{t)dt (2) 

Jti 

where u{t) is the value of the L .uncertainty subattribute as a function of time. 

The uncertainty unit cost C2 is the ratio between the cost of a unit of uncer- 
tainty and the cost of a unit of deviation. Consider an answer returned by the 
DBMS: ’’the current location of the moving object m is {x,y), with a deviation 
of at most u units” . C2 should be set higher than 1 if the uncertainty in such 
an answer is more important than the deviation, and lower than 1 otherwise. 
Observe that in a dead-reckoning update policy each update message establishes 
a new uncertainty which is not necessarily lower than the previous one. Thus 
communication reduces the deviation but not necessarily the uncertainty. 

Now we are ready to define the information cost of a trip taken by a moving 
object m. Let t\ and t2 be the time-stamps of two consecutive location update 
messages. Then the information cost in the interval [ti, t2) is: 

COSTi[h,t 2 ) = Cl + COST 4 h,t 2 ) + COST 4 h,t 2 ) (3) 

Observe that COSTi[ti,t2) includes the message cost at time ti but not 
the cost of the one at time t2- Observe also that each location update message 
writes the actual current location of m in the database, thus it reduces the 
deviation to zero. The total information cost of a trip is computed by summing 
up COiST/[ti, O) for every pair of consecutive update points ti and t2 - Formally, 
let the time points of the update messages sent hy mheti,t2, ■■■,tk- Furthermore , 
let 0 be the time point when the trip started and t^+i the time point when the 
trip ended. Then the total information cost of a trip is 

k 

COSTj = COST 40 , ti) + COST 40 , h) + COSTjIU, t,+i) (4) 

4 Cost Based Optimization for Dead Reckoning Policies 

As mentioned in the introduction, a dead-reckoning update policy for a moving 
object m dictates that at any point in time there is a database-update threshold 
th, of which both the DBMS and m are aware. When the deviation of m reaches 
th, m sends to the database an update consisting of the current location, the 
predicted speed, and the new deviation threshold K. The objective of the dead 
reckoning policies that we introduce in this paper is to set K (which the DBMS 
installs in the L .uncertainty subattribute), such that the total information cost 
is minimized. Intuitively, this is done as follows. First, m predicts the future 
behavior of the deviation. Based on this prediction, the average cost per time 
time unit between now and the next update is obtained as a a function / of the 
new threshold K. Then K is set to minimize /0. It is important to observe that 

Let us observe that the proposed method of optimizing the new threshold K is 
not unique. We have devised other methods which are omitted from this extended 
abstract. A performance comparison among these methods is the subject of future 
work. 
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we optimize the average cost per time unit rather than simply the total cost 
between the two time points; clearly, the total cost increases as the time interval 
until the next update increases. 

The next theorem establishes the optimal value K for L .uncertainty under 
the assumption that the deviation between two consecutive updates is a linear 
function of time. 



Theorem 1 : Denote the update cost by C\, and the uncertainty unit cost by 
C 2 . Assume that for a moving object two consecutive location updates occur at 
time points ti and t2- Assume further that between ti and t^, the deviation d{t) is 
given by the function a(t — ti) where t\ < t < t2 and a is some positive constant; 
and L .uncertainty is fixed at K throughout the interval {t\,t2). Then the total 



information cost per time unit between t\ and t2 is minimized if K 

□ 



2qC 1 
202 + 1 - 



The implication of theorem 1 is the following. Suppose that a moving ob- 
ject m is currently at time point t\, i.e. its deviation has reached the un- 
certainty threshold L .uncertainty . Now ni needs to compute a new value for 
L .uncertainty and send it in the location update message. Suppose further that 
m predicts that following the update the deviation will behave as the linear 
function a{t — ti), and in the update message it has to set the uncertainty 
threshold L .uncertainty to a value that will remain fixed until the next update. 
Then, in order to optimize the information cost, m should set the threshold to 



K = 



‘2dC 1 
202 + 1 - 



Next assume that, in order to detect disconnection, one is interested in a 
dead-reckoning policy in which the uncertainty threshold L .uncertainty contin- 
uously decreases between updates. Particularly, we consider a particular type of 
decrease, that we call fractional decrease; other types exist, but we found this 
one convenient. Let K be a constant. If the uncertainty threshold L .uncertainty 
decreases fractionally starting with K, then during the first time unit after a 
location update u its value is K , during the second time unit after u its value 
is AT/2, during the third time unit after u its value is K/ 3 , etc., until the next 
update (which establishes a new K). 

Theorem 2 : Assume that for a moving object two consecutive location 
updates occur at time points t\ and t 2 - Assume further that between ti and 
t2, the deviation d{t) is given by the function a{t — ti) where t\ < t < t2 
and a is some positive constant; and in the time interval (^ 1 ,^ 2 ) L .uncertainty 
decreases fractionally starting with a constant K. Then the total information 
cost per time unit between ti and ^2 is given by the following function of K. 

f{K) = ^ □ 



Similarly to theorem 1, the implication of theorem 2 is the following. Suppose 
that a moving object is currently at time point ti, i.e. it is about to send a loca- 
tion update message, and it can predict that following the update the deviation 
will behave as the linear function a{t — ti), and in the update message it sets the 
uncertainty threshold L.uncertainty to a fractionally decreasing value starting 
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with K. Then in order to optimize the information cost it should set K to the 
value that minimizes the function of theorem 2 . 



5 The Location Update Policies and Their Performance 

In this section we describe and motivate three location update policies. Then we 
report on their comparison by simulation. 

The speed dead-reckoning (sdr) policy. At the beginning of the trip the 
moving object m sends to the DBMS an uncertainty threshold that is selected 
in an ad hoc fashion, it is stored in L .uncertainty , and it remains fixed for the 
duration of the trip. The object m updates the database whenever the devia- 
tion exceeds L. uncertainty; the update simply includes the current location and 
current speed. 0 □ 

The adaptive dead reckoning (adr) policy. At the beginning of the 
trip the moving object m sends to the DBMS an initial deviation threshold thi 
selected arbitrarily. Then m starts tracking the deviation. When the deviation 
reaches thi, the moving object sends an update message to the database. The 
update consists of the current speed, current location, and a new threshold th2 
that the DBMS should install in the L .uncertainty subattribute. th2 is computed 
as follows. Denote by the number of time units from the beginning of the trip 
until the deviation reaches thi for the first time, by I\ the cost of the deviation 
(which is computed using equation ^ during the same time interval, and let 

a\ = ^^■ Then t/i2 is ^°2C2 (remember, C\ is the update cost, C2 is the 
unit-uncertainty cost). When the deviation reaches f/i2, a similar update is sent, 
except that the new threshold th^ is 1 +2C2 ’ 02 = ^ {h is the cost of 

the deviation from the first update to second update, ^2 is the number of time 
units elapsed since the first location update). Since 02 may be different than ai, 
t/i2 may be different than th^. When th^ is reached the object will send another 
update containing thi (which is computed in a similar fashion), and so on. □ 

The mathematical motivation for adr is based on theorem 1 in a straight- 
forward way. Namely, at each update time point pi adr simply sets the next 
threshold in a way that optimizes the information cost per time unit (according 
to theorem 1 ), assuming that the deviation following time pi will behave as the 
following linear function: d{t) = ^t, where t is the number of time units after 
Pi, and ti is the number of time units between the immediately preceding update 
and the current one (at time pi), and li the cost of the deviation during the same 
time interval. The reason for this prediction of the future deviation is as follows. 
Adr approximates the current deviation, i.e. the deviation from the time of the 



® Sdr can also use another speed, for example, the average speed since the last update, 
or the average speed since the beginning of the trip, or a speed that is predicted based 
on knowledge of the terrain. This comment holds for the other policies discussed in 
this section. 
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immediately preceding update to time pi, by a linear functioi 0 (see d) with 
slope Observe that at time pi this linear function has the same deviation 

cost (namely li) as the actual current deviationQ. Based on the locality principle, 
adr predicts that after the update at time pi, the deviation will behave according 
to the same approximation function. 

The disconnection detection dead reckoning (dtdr) policy. At the 

beginning of the trip the moving object m sends to the DBMS an initial devia- 
tion threshold thi selected arbitrarily. The moving objects sets the uncertainty 
threshold L .uncertainty to a fractionally decreasing value starting with thi. 
That is, during the first time unit the uncertainty threshold is thp, during the 
second time unit period it is and so on. Then it starts tracking the deviation. 
At time ti when the deviation reaches the current uncertainty threshold, namely 
the moving object sends a location update message to the database. The 
update consists of the current speed, current location, and a new threshold t/i2 
to be installed in the L .uncertainty subattribute. 

th2 is computed using the function f{K) of theorem 2 . Since f{K) uses 
the slope a of the future deviation, we first estimate the future deviation as 
in the adr case, as follows. Denote by Ji the cost of the deviation (which is 
computed using equation TO since the beginning of the trip, and let oi = 
Now, observe that f{K) does not have a closed form formula. Thus we first 
approximate the sum i + by Zn(^/^) (since Inn is an approximation 



for the nth harmonic number). Thus the approximation function for f{K) is 



g{K) = 



Ci-hf +C2if(in(y^)-H) 



. The derivative of g{K) is zero when K is the 



solution to the following equation. 



ln{K) = ^-d2 
K 



( 5 ) 



where in the equation, c?i = and c?2 = ^ -I- 4 — ln(ai). We find a numerical 
solution to this equation using the Newton Raphson method. The solution is 
the new threshold t/i2, and the moving object sets the uncertainty threshold 
L .uncertainty to a fractionally decreasing value starting with th2. 

After t2 time units, when the deviation reaches the current uncertainty 
threshold, namely a location update containing th^ is sent, th^ is com- 
puted as above, except that a new slope (as in adr) U2 is used; and I2 is the 
cost of the deviation during the previous t2 time units. The process continues 
until the end of the trip. That is, at each update time point, dtdr determines the 
next optimal threshold by the constants C\, C2, and the slope of the current 
deviation approximation function. □ 

® More powerful, nonlinear approximation functions have been considered, but their 
discussion is omitted from this extended abstract. 

^ In this sense we are using a simple linear regression, but instead of the common least 
squares method we employ an equal sums method that is more appropriate for our 
cost function. 
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We compared by simulation the three policies introduced in this paper namely 
adr, dtdr, and sdr. The parameters of the simulation are the following. The 
update-unit cost, namely the cost of a location-update message; the uncertainty- 
unit cost, namely the cost of a unit of uncertainty; deviation-unit cost, namely 
the cost of a unit of deviation; a speed curve, namely a function that for a 
period of time gives the speed of the moving object at any point in time. The 
comparison is done by quantifying the total information cost of each policy for 
a large number of combinations of the parameters. For space considerations we 
omit the detailed results of the simulations. The main conclusions are: 1. adr is 
superior to sdr in the sense that it has a lower or equal information cost for every 
value of the update-unit cost, uncertainty-unit cost, and deviation-unit cost; for 
some parameter combinations the information cost of sdr is six times as high as 
that of adr. 2. adr is superior to dtdr in the same sense; the difference between 
the costs of the two policies quantifies the cost of disconnection detection. 

6 Querying with Uncertainty 

In this section we present a probabilistic method for specifying and processing 
range queries about motion databases. For example, a typical query might be 
“Retrieve all objects o which are within the region R” . Since there is an uncer- 
tainty about the location of the various objects at any time, we may not be able 
to answer the above query with absolute certainty. Instead, our query processing 
algorithm outputs a set of pairs of the form (o,p) where o is an object and p is 
the probability that the object is in region R at time t, actually, the algorithm 
retrieves only those pairs for which p is greater than some minimum value. Note 
that here we are using probability as a measure of certainty. 

As indicated, we assume that all the objects are traveling on routes. Since the 
actual location is not exactly known, we assume that the location of an object 
o on its route at time t is a random variable Pj. We let fo{x) denote the density 
function of this random variable. More specifically, for small values of dx, fo{x)dx 
denotes the probability that o is at some point in the interval [x, x + dx\ at time t 
(actually, fy is a function of x and t; however we omit this as t is understood from 
the context). The mean mo of the above random variable is given by the database 
location of o (this equals o.L.startlocation + o.L.speed(t — o.L. starttime)-, see 
section 2). 

Now we discuss some possible candidates for the density functions fy. Many 
natural processes tend to behave according to the normal density function. Let 
Am,cr( x) denote a normal density function with mean m and standard deviation 
a. We can adopt the normal density functions follows. We take the mean m 
to be equal to TOq given in the previous paragraph. Next we relate the stan- 
dard deviation to the uncertainty of the object location. We do this by setting 
cr = ^{o.L. uncertainty) where c > 0 is constant. In this case, the probability 
that the object is within a distance of o.L. uncertainty (i.e. within a distance 
of ccr) from the location mo will be higher for higher values of c; for example, 
this probability will be equal to .68, .95 and .997 for values of c equal to 1,2 and 
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3 respectively (see [3j). The value of c is a function of the update policy used, 
the reliability of the network, the time since the last update and the ratio be- 
tween uncertainty and deviation unit costs. Whatever may be the value c, there 
is still a non-zero probability p that the object is at a distance greater than 
o.L. uncertainty from the the mean me,; we can interpret this probability to be 
the probability that there is a disconnection. An alternative is to make p zero. 
This can be done by modifying the normal distribution to be a bounded nor- 
mal distribution. This is done by conditioning that the object is within distance 
o.L. uncertainty from mo. More specifically, we first use the normal distribu- 
tion as above by choosing an appropriate c. Then we compute the probability 
q that the object is within a distance of o.L. uncertainty from mo- Define the 
density function fo{x) to be equal to -Afmo,cr{x) for values of x within the inter- 
val [{mo — o.L. uncertainty), {mo + o.L. uncertainty)] and to be zero for other 
values of x. 

A range query on motion databases is of the following form: 

RETRIEVE o FROM Moving-objects WHERE C. 

Here the condition part C is a disjunction of clauses where each clause is a 
conjunction of two conditions C\ and C2; C\ only refers to the static attributes 
of the objects (such as ’type’, ’color’ etc.) and is called the static part; C2 depends 
upon the location attributes and is called the dynamic part of the condition (if 
C is not in this form then we can convert it into disjunctive normal form; in the 
resulting condition each disjunct can be separated into two parts — the static 
part and the dynamic part). To process such a query, we process each disjunct as 
a separate query and take the union. Now we describe a method to process query 
whose condition part is a conjunction of the static and dynamic parts C\ and 
C2 respectively (one can envision other possible methods) . Using the underlying 
database management system we execute the query whose condition part is only 
Cl . The set of objects thus retrieved are processed against the condition C2 and 
appropriate probability values are calculated as follows. 

We assume that the dynamic part of the query condition is formed by using 
atomic predicates inside{o, R), within jlistance{o, R, d) and boolean connectives 
A (“and”) and ^ (“not”) (note that V , i.e. the “or” operator, can be defined using 
A and ^). In the atomic predicate inside{o, R), o is an object variable and R is 
the name of a region. An object oi satisfies this predicate at time t if its location 
lies within the region R. The predicate within jiistance{o, R, d) is satisfied by 
an object oi traveling on route ri if the location of oi is within distance d of the 
region R (here the distance is measured along the route, i.e. the route distance). 
The following is an example query. 

RETRIEVE o FROM Moving-objects WHERE o.type = 

'ambulance' Ainside{o, R) 
Consider an object oi traveling on route ri. We assume that the route ri in- 
tersects the region R at different places, and certain segments of the route 
are in the region R; each such segment is given by an interval [u, v] (where 
u < v). For the route ri, we let Inside Jnt{r\,R) denote the set of all such 
intervals. Clearly, object oi is in region R at time t, if its location at t lies within 
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any of the intervals belonging to Inside-Int(ri, R). Using the set of intervals 
Inside-Int(ri, R), we can easily compute another set of intervals on route r\ , 
denoted by WithinRnt{ri, R, d), such that every point belonging to any of these 
intervals is within distance d of region R. 

Now consider a condition q formed using the above atomic predicates and 
using the boolean connectives. We assume that q has only one free object variable 
o. Now we describe a procedure for evaluation of this condition against a set of 
objects. The satisfaction of this condition by an object o\ traveling on route r\ 
at time t only depends on the location of the object at time t. We first compute 
the set of all such points. We say that a point x on the route ri satisfies the 
query q, if an object o\ at location x satisfies q. By a simple induction on the 
length of q, it is easily seen that the set of points on route ri that satisfy q 
is given by a collection of disjoint intervals (if q is an atomic predicate then 
this is trivially the case as indicated earlier; if g is a conjunction q\ and q2 the 
resulting set of intervals for q is obtained by taking pairwise intersection of an 
interval belonging to that of qi and another belonging to that of q2 etc.). We let 
Int{ri,q) denote this set of intervals. A simple algorithm for computing this set 
is given below. The probability that o\ satisfies q at time t equals the probability 
that the current location of oi lies within any of the intervals in Int{ri,q). Let 
{/i, /2, ..., Ifc} be all the intervals in Int{ri,q). Since all the intervals in Int{ri, q) 
are disjoint, it is the case that for any two distinct intervals R and Ij the events 
indicating that oi is inside the interval R ( resp., inside Ij) are independent. 
Hence, the probability that o\ satisfies q is equal to the sum, over all intervals I 
in Int{ri, q), of the probability that o\ is in the interval I. 

Theorem 3: For a query q and route ri, let {Ii, R, Ik} be all the 
intervals in Int{ri,q) where A = [ui,Vi]. Then, the probability that object o\ 
traveling on route r\ satisfies q at time t is given by X)i=i /«* foi{x)dx. □ 

For the route ri, the set of intervals Int{ri,q) is computed inductively on 
the structure of q as follows. 

g is an atomic predicate: If g is inside{o,R), Int{ri,q) is the same as 
Inside Jnt{r\,R) and this is obtained directly from the database, possi- 
bly using a spatial indexing scheme. If g is within-distance{o, R, d) then 
Int{r\,q) is same as Within-Int(ri, R,d), and this can be computed di- 
rectly from Inside R). The list of intervals Int{ri,R) is output in 

sorted order. 

g = q\/\q2' First we compute the lists Jnt(ri, gi) and g2). After this, we 
take an interval R from the first list and an interval R from the second list, 
and output the interval /in/2 (if it is non-empty); the set of all such intervals 
will be the output. Since the original two lists are sorted, the above procedure 
can be implemented by a modified merge algorithm. The complexity of this 
procedure is proportional to the sum of the two input lists, 
g = ^gi: First we compute Int{ri,qi). We assume that the length of the route 
ri is R; thus the set of all points on ri is given by the single interval [ 0 ,h]. 
The set of all points on ri that satisfy g is the complement of the set of points 
that satisfy gi where this complement is taken with respect to all the points 
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on the route; clearly, this set of points is a collection of disjoint intervals. Now, 
it is fairly straightforward to see how the sorted list of intervals in Int{ri, q) 
can be computed from Int{ri,qi)] the complexity of such a procedure is 
simply linear in the number of intervals in Int(ri,qi). 

If Li, L 2 , ■■■, Lk are the lists of intervals corresponding to the atomic predi- 
cates appearing in q and I is the sum of the lengths of these lists, and m is the 
length of q then it can be shown that the complexity of the above procedure is 
0{lm). 

Now consider the query 

RETRIEVE o FROM Moving-objects WHERE Ci A C 2 

where Ci, C 2 , respectively, are the static and the dynamic parts of the condition. 
The overall algorithm for processing the query is as follows. 

1. Using the underlying database process the following query. 

RETRIEVE o FROM Moving-objects WHERE Ci. 

Let O be the set of objects retrieved. 

2. Using the underlying database retrieve the set of routes R on which the 
objects in O are traveling. 

3. For each atomic predicate p appearing in C 2 and for each route r\ in R, 
retrieve the list of intervals Int{ri,p). This is achieved by using any spatial 
indexing scheme. 

4. Using the algorithm presented earlier, for each route ri, compute the list of 
intervals Int(ri,q). 

5. For each route r\ and for each object o\ traveling on ri, compute the prob- 
ability that it satisfies q using the formula given in theorem 3. 

7 Relevant Work 

One research area to which this paper is related is uncertainty and incomplete 
information in databases (see for example jt)ll j for surveys). However, as far as 
we know this area has so far addressed complementary issues to the ones in 
this paper. Our current work on location update policies addresses the question: 
what uncertainty to initially associate with the location of each moving object. 
In contrast, existing works are concerned with management and reasoning with 
uncertainty, after such uncertainty is introduced in the database. Our proba- 
bilistic query processing approach is also concerned with this problem. However, 
our uncertainty processing problem is combined with a temporal-spatial aspect 
that has not been studied previously as far as we know. 

Our problem is also related to mobile computing, particularly works on loca- 
tion management in the cellular architecture. These works address the following 
problem. When calling or sending a message to a mobile user, the network infras- 
tructure must locate the cell in which the user is currently located. The network 
uses the location database that gives the current cell of each mobile user. The 
record is updated when the user moves from one cell to another, and it is read 
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when the user is called. Existing works on location management (see, for ex- 
ample, address the problem of allocating and distributing the location 

database such that the lookup time and update overhead are minimized. Loca- 
tion management in the cellular architecture can be viewed as addressing the 
problem of providing uncertainty bounds for each mobile user. The geographic 
bounds of the cell constitute the uncertainty bounds for the user. Uncertainty 
at the cell-granularity is sufficient for the purpose of calling a mobile user or 
sending him/her a message. When it is also sufficient for MOD applications, the 
location database can be sold by wireless communication vendors to mobile fleet 
operators. However, often uncertainty at the cell granularity is insufficient. For 
example, in satellite networks the diameter of a cell ranges from hundreds to 
thousands of miles. 

Another relevant research area is constraint databases (see for a survey) . In 
this sense, our location attributes can be viewed as a constraint, or a generalized 
tuple, such that the tuples satisfying the constraint are considered to be in the 
database. Constraint databases have been separately applied to the temporal 
(see PI) domain, and to the spatial domain (see PHI)- Constraint databases can 
be used as a framework in which to implement the proposed update policies and 
query processing algorithm. 

Finally, the present paper extends the work on which we initially reported 
in m in two important ways. First, in this paper we introduce a quantitative 
new probabilistic model and method of processing range queries. In contrast, 
in previous works we took a qualitative approach in the form of ’’may” and 
’’must” semantics of queries. Second, in this paper we introduce uncertainty as a 
separate concept from deviation. The previous work on update policies (i.e. jOj) 
is not equipped to distinguish between uncertainty and deviation. Consequently, 
The location update policies discussed in this paper are different in two respects 
from the update policies in 0. First, they take uncertainty into consideration 
when determining when to send a location update message. Second they are dead 
reckoning policies; namely they provide the uncertainty, i.e. the bound on the 
deviation, with each location update message. In contrast, the [0] policies are not 
dead reckoning in the sense that the moving object does not update its location 
when the deviation reaches some threshold; the update time-point depends on 
the overall behavior of the deviation since the last update. Our simulation results 
indicate that the [S| policies are inferior to adr (and often to dtdr as well) when 
the uncertainty cost is taken into consideration, and this inferiority increases as 
the cost per unit of uncertainty increases. 



8 Conclusion 

In this paper we considered dead-reckoning policies for updating the database 
location of moving objects, and the processing of range queries for motion 
database. When using a dead-reckoning policy, a moving object equipped with 
a Geographic Positioning System periodically sends an update of its database 
location and provides an uncertainty threshold th. The threshold indicates that 
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the object will send another update when the deviation, namely the distance 
between the database location and the actual location, exceeds th. 

Dead-reckoning policies imply that the DBMS answers a query about the lo- 
cation of an object m by: ’’the current location of m is (x,y) with a deviation of 
at most t/i” . When making decisions based on such an answer, there is a cost in 
terms of the deviation of m from (x,y), and in terms of the uncertainty about its 
location. These costs should be balanced against the cost (in terms of wireless 
bandwidth, and update processing) of sending location update messages. We 
introduced a cost model that captures the tradeoffs between communication, 
uncertainty and deviation by assigning costs to an uncertainty unit, a deviation 
unit, and a communication unit. We explained that these costs should be deter- 
mined by answering questions such as: how many messages is the system willing 
to utilize in order to reduce the deviation by one unit during a unit of time? Is 
a unit of uncertainty more important than a unit of deviation, or vice versa? 

Then we introduced two dead-reckoning policies, adaptive dead-reckoning 
(adr), and disconnection detection dead-reckoning (dtdr). Both adjust the un- 
certainty threshold at each update to the current motion (or speed) pattern. 
This pattern is captured by the concept of the predicted deviation. The differ- 
ence between the two policies is that dtdr uses a novel technique for disconnection 
detection in mobile computing, namely decreasing uncertainty threshold. Intu- 
itively, the technique postulates that the probability of communication should 
increase as the period of time since the last communication increases. Thus, the 
probability of the object being disconnected increases as the period of time since 
the last update increases. Dtdr demonstrates the use of this technique. 

Then we reported on the development of a simulation testbed for evaluation 
of location update policies. We used it in order the compare the information cost 
of adr, dtdr, and speed dead-reckoning (sdr) in which the uncertainty threshold 
is arbitrary and fixed. The result of the comparison is that adr is superior to the 
other policies in the sense that it has a lower information cost. Actually, it may 
have an information cost which is six times lower than that of sdr. We quantified 
the disconnection detection cost as the difference between the cost of dtdr and 
that of adr. We also determined that when taking uncertainty into consideration, 
the information costs of adr and dtdr are lower than that of non-dead-reckoning 
policies which we developed previously. 

Finally, an additional contribution of this paper is a probabilistic model and 
an algorithm for query processing in motion databases. In our model the location 
of the moving object is a random variable, and at any point in time the database 
location and the uncertainty are used to determine a density function for this 
variable. Then we developed an algorithm that processes range queries such as 
‘retrieve the moving objects that are currently inside a given polygon P’. The 
answer is a set of objects, each of which is associated with the probability that 
currently the object is inside P. 

Now consider the following variant of the location update problem. In some 
cases MOD applications may not be interested in the location of moving ob- 
jects at any point in time, but in their arrival time at the destination. Assume 
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that the database arrival information is given by “The object is estimated to 
arrive at destination X at time t, with an uncertainty of U" . In other words, t 
is the database estimated-arrival-tim^^ (eat) and we assume that at any point 
in time before arrival at destination X, the moving object can compute the 
actual ea10 t' . The difference between t and t' is the deviation, and the uncer- 
tainty U denotes the bound on the deviation of the eat; the object will send an 
eat update message when the deviation reaches U. In this variant, the motion 
database update problem is to determine when a moving object should update 
its database estimated-arrival-time. The results that we developed in this paper 
for the location update problem carry over verbatim to the eat update problem. 
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Abstract. Spatial databases are modeled as closed semi-algebraic sub- 
sets of the real plane. First-order logic over the reals (expanded with 
a symbol to address the database) provides a natural language for ex- 
pressing properties of such databases. Motivated by applications in ge- 
ographical information systems, this paper investigates the question of 
which topological properties can be thus expressed. We introduce a novel, 
two-tiered logic for expressing topological properties, called CC, which is 
subsumed by first-order logic over the reals. We put forward the question 
whether the two logics are actually equivalent (when restricting attention 
to topological properties). We answer this question affirmatively on the 
class of “region databases.” We also prove a general result which further 
illustrates the power of the logic CC,. 



1 Introduction and Summary 

A simple yet powerful way of modeling spatial data is using semi- algebraic sets. 
A subset A of n-dimensional Euclidean space R" is called semi-algebraic if it can 
be defined by a Boolean system of polynomial inequalities. First-order logic over 
the reals, denoted here by FO[R], then becomes a spatial query language, fitting 
in the (by now rather well known) framework of constraint query languages 
introduced by Kanellakis, Kuper and Revesz M- The goal of this paper is to 
understand the power of this formalism in expressing topological queries Q 

We will work with planar spatial databases, whose content are described 
by semi-algebraic sets S in the plane R^. An example of a first-order query in 
this context is “is the database bounded?”, which can be expressed in FO[R] as 
{3b > 0)yxyy{S{x,y) — > {—b < x < bA—b < y < 6))Hwe will consider only sets 

* Post-doctoral research fellow of the Fund for Scientific Research of Flanders (FWO- 
Vlaanderen) . 

^ The work we will present is similar in spirit to work done in topological model theory 
mm, though the technical focus is quite different. 

^ The subformula —b < x is, of course, a shorthand for (3z){z -\-b = 0Az<x). Note 
that formally, we work in an expansion of hrst-order logic over the reals with a binary 
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that are closed in the ordinary topology on R^. This assumption is of great help 
from a technical point of view, and is harmless from a practical point of view. 

Topological properties. A property of spatial databases is called topological if 
it is invariant under topological transformations of the plane. More precisely, 
whenever the property holds for some A, it must also hold for any other A 
that is the image of A under a homeomorphism of the plane 0 For example, 
the above-mentioned property “the database is bounded” is topological, as is 
the property “the database consists of (curved) lines only”. In contrast, the 
property “the database contains a straight line” is not. Apart from our interest 
in topological properties as a natural and mathematically well-motivated class 
of properties, they are also practically motivated by geographical information 
systems |b l V I H I 14 l lHj . 

So far there was not much understanding yet of the class of topological prop- 
erties that are first-order (i.e., expressible in FO[R]), except for the feeling that 
this class must be rather meager. Indeed, many topological properties are not 
first-order; for exaimle, one cannot express in FO[R] that the database is topo- 
logically connectedQ But exactly which topological properties are first-order? 

Cone Logic. What we do understand quite well is when two given sets A and 
A' are topologically elementary equivalent. This means that any FO[R]-sentence 
that is topological will not distinguish between A and A' . Indeed, Paredaens and 
the present authors HS| discovered a characterization of topological elementary 
equivalence in terms of the cone types occurring in the two given databases. 
Semi-algebraic sets are topologically well-behaved in that locally around each 
point they are “conical” ^ . The cone of a point can either be completely filled 
(in case of points in the interior of the set), completely empty (in case of iso- 
lated points or points not in the set), or consisting of lines and regions arriving 
in the point. A database can be partitioned according to the cone types of its 
points. The characterization states that two databases are topologically elemen- 
tary equivalent if and only if the cardinalities of the equivalence classes of their 
partitions match. 

In this paper we introduce Cone Logic (CC)^ in which only topological prop- 
erties can be expressed. The logic CC is two-tiered: at the bottom tier, there is a 
first-order logic for expressing properties of cones, which can talk about the lines 
and regions making up the cone, and their relative order in the cone. At the top 
tier, any sentence 7 from the bottom tier can be used in an “atomic” formula of 
the form [7](p), where p is a point variable; this formula expresses that the cone 
of p satisfies property 7. The only other atomic predicate at the top tier is the 

relation symbol S to address the content of the database. However, we will use the 
same notation FO[R] to denote this first-order query language. 

® A homeomorphism of the plane is a bijection / : ^ R^ such that both / and 

are continuous. 

This follows from the combined results of Benedikt, Dong, Libkin and Wong Q and 
Grumbach and Su m- 
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symbol S to address the database; the top tier is then closed under the standard 
first-order operations. An example of a sentence in C£ is 

yp[3xR{x) — > 3\xL{x)]{p), 

which expresses that at every point bordering a region (i?), there can be at most 
one line (L) entering that region. Another example is 

3p\3x3y3z3u{R{x) A L{y) A R{z) A L{u) A B{x, y, z) A B{z, u, a;))](p), 

which expresses that there is a point where two regions meet, and through which 
a line runs between the two regions. (The predicate B{x, y, z) denotes that cone 
element y lies between cone elements x and z.) 

Note that while FO[R] talks about points in terms of their coordinates, 
CC can only talk about points directly and does not even have access to their 
coordinates. Every property expressible in CC is also expressible in FO[R]. We 
investigate the question of the converse: is CC first-order complete? That is, is 
every first-order topological property expressible in CC? 

Circular languages. As a first illustration of the power of CC, we show that any 
property of cones expressible in FO[R] can also be expressed in CC. Since a 
non-trivial cone can be represented as a circular list of L’s and i?’s, an arbitrary 
property of cones can be represented as a set of such circular lists; we call such 
a set a circular language. We prove for any circular language T that if “the cone 
of point {x,y) satisfies T” is expressible in FO[R], then “the cone of point p 
satisfies T” is expressible in CC. 

Region databases. A database is called a region database if, intuitively, it only 
contains “filled” figures. More precisely, the cone of every point in the database 
must either be completely full or consist exclusively of i?’s (regions). Region 
databases appear often in geographical information systems. 

With each region database we can associate an abstract directed graph of a 
very simple form. For each singular point p in the database there is a “parent” 
node in the graph with outgoing edges to n “child” nodes, where n is the num- 
ber of R’s in the cone of p. The sets of child nodes for different parent nodes 
are disjoint. Importantly, by the above-mentioned characterization of topologi- 
cal elementary equivalence, any two topologically elementary equivalent region 
databases have the same associated abstract graph. 

Our second main result is then that a topological property of region databases 
is expressible in FO[R] if and only if it is expressible in standard first-order logic 
when looking at the abstract graph of a database instead of at the database 
itselfi Using a quantifier elimination procedure, we obtain as a corollary the 
first-order completeness of CC on the class of region databases. 

The general question of first-order completeness of CC the class of all planar 
spatial databases remains open. Other open questions are to extend our results 

® With “standard” first-order logic of graphs we mean first-order logic over one binary 
relation E, used to address the edges of the graph. 
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to databases consisting of multiple semi-algebraic sets (rather than just one), or 
to non-planar (e.g., 3D) databases. 



Lifting collapse theorems. In the proofs of our completeness results we make 
heavy use of a powerful tool: a “collapse theorem” by Benedikt, Dong, Libkin 
and Wong |2j. This theorem says that any FO[R]-definable property of finite 
databases that is invariant under monotone bijections from R to R, is already 
expressible by a sentence that uses no arithmetic, except for the order predicate. 
So, this sentence mentions only the predicate < and the relation symbol S for 
the database content. 

Now CL is subsumed by first-order logic over (<,S). Hence, our first-order 
completeness result for CL “lifts” collapse to the level of infinite, semi-algebraic, 
sets, which are much more relevant in the spatial context than finite databases. 



Acknowledgment. We thank Jan Paredaens for a number of inspiring discussions 
we had with him in the initial stage of this work. 



2 Preliminaries 

Spatial databases. We denote the real numbers by R, so R^ denotes the real 
plane. A semi-algebraic set in R^ is a set of points that can be defined as 

{{x,y) G R2 I <p{x,y)}, 

where ip{x,y) is a formula built using the Boolean connectives A, V, and ^ from 
atoms of the form P{x,y) > 0, where P{x,y) is a polynomial in the variables 
X and y with integer coefficients. Observe that P = 0 is equivalent to ^{P > 
0)A->(— P>0), so equations can be used as well as inequalities. 

In this paper, a database is defined as a semi-algebraic set in R^ that is 
closed in the ordinary topological sense. It is known ^ that these are precisely 
the finite unions of sets of points that can be defined as 

{{x,y) G R^ I Pi{x,y) > 0 A ... A Pm{x,y) > 0}. 

In other words, we disallow the essential use of strict inequalities in the definition 
of a database. 

First-order logic over the vocabulary (0, 1, -I-, x, <, S), with S a binary rela- 
tion symbol, is denoted by FO[R]. An FO[R]-formula (p can be evaluated on a 
database A by letting variables range over R, interpreting the arithmetic sym- 
bols in the obvious way, and interpreting S{x,y) to mean that the point (x,y) 
is in A. 

To formalize what it means for two databases A and B to be topologically 
the same, we use the notion of isotopy. An isotopy is a continuous deformation 



On Capturing First-Order Topological Properties 191 



of the plane 3 A and S are called isotopic if there is an isotopy h such that 
h{A) = bE 

An FO[R]-sentence (p is called topological if whenever databases A and B 
are isotopic, then A\= ip \i and only \i B \= p. Finally, two databases A and B 
are called topologically elementary equivalent if for each topological sentence p, 
A\= p \i and only if B \= p. 

Cones. A known topological property of semi-algebraic sets ^ is that locally 
around each point they are conical. This is illustrated in Figure E For every 
point p of a semi-algebraic set A there exists an £ > 0 such that D{p, s) n A is 
isotopic to the planar cone with top p and base C(p, £) n aH We thus refer to 
the cone of p in A. 




Fig. 1. A database and the cone of one of its points. 



A database is also conical around the point at infinity!! More precisely, there 
exists an e > 0 such that {{x,y) \ + y^ > e^} n A is isotopic to {A • (x,y) \ 

{x, y) e C((0, 0), e) n A A A > 1}. We can indeed view the latter set as the cone 
with top oo and base C((0, 0), £) fl A, and call it the cone of oo in A. 

We use the following finite representation for cones. The cone having a full 
circle as its base (which appears around interior points) is represent by the letter 
F. Any other cone can be represented by a circular list of L’s and R’s (for “line” 

® Formally, an isotopy is a homeomorphism of the plane that is isotopic to the identity. 
Two homeomorphisms / and g are isotopic if there is a continuous function F : 
X [0, 1] — > such that for each t € [0, 1], the function Ft : R^ — > R^ : p i— > F{p, t) 
is a homeomorphism and Fq is / and Fi is g. 

^ A more relaxed notion of “being topologically the same” is to simply require that B 
is the image of A under a homeomorphism rather than an isotopy. The only difference 
between the two notions is that the latter considers mirror images to be the same, 
while the former does not. Indeed, every homeomorphism either is an isotopy itself, 
or is isotopic to a reflection 0- All the results we will present under isotopies have 
close analogues under homeomorphisms. 

® D{p,e) is the closed disk with center p and radius e; C{p,e) is its bordering circle. 

® If we project R^ stereographically onto a sphere, the point at infinity corresponds 
to the missing point on the sphere. 
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and “region” ) which describes the cone in a complete clockwise turn around the 
top. For example, the cone of Figure Q] is represented by (LLRLR). The cone 
with empty base (which appears around isolated points) is represented by the 
empty list ( ). The set of all cones, represented in the way just explained, will be 
denoted by C. 

Let A be a database. The point structure of A is the function II (A) from 
A U { 00 } to C that maps each point to its cone in A. It can be shown that 
n{A)~^ is empty on all but a finite number of cones. Moreover, there are only 
three cones where II{A)~^ can be infinite: F, (LL) (the cone around points on 
curves), and (i?) (the cone around points on the smooth border of a region). It 
can indeed be shown that in each database, the points with a cone different from 
these three are finite in number. The points are called the singular points of the 
database. 

Let A and B be databases. We say that II (A) is isomorphic to II (B), denoted 
by n{A) ~ n{B), if there is a bijection / from A U { 00 } to iJ U { 00 } with 
/(oo) = 00 , such that n{A) = II (B) o /. Paredaens and the present authors 
gave the following characterization m 

Theorem 1. Two databases A and B are topologically elementary equivalent if 
and only if 11(A) = 11(B). 



3 Cone Logic 

In this section we introduce the logic CC (cone logic). This is a two-tiered logic. 
At the bottom tier we have a first-order logic for expressing properties of cones. 
At the top tier we can use sentences from the bottom tier to talk about points 
in the database and their cones. 

Logical properties of cones. Consider the vocabulary C consisting of the propo- 
sitional symbols F and E, the unary relation symbols L and R, and the ternary 
relation symbol B. First-order logic sentences over C will be called C-sentences. 

An arbitrary cone can be viewed as a finite C-structure as follows. The full 
cone F is viewed as the empty structure where proposition F is true (and propo- 
sition E is false); the empty cone ( ) is viewed as the empty structure where E 
is true (and E false). A cone of the form (cq . . . c„_i), where each Ci is L or R, 
is viewed as the structure with domain {0,...,n — 1} in which propositions F 
and E are false; relation L equals {i \ Ci = L}; relation R equals {i \ Ci = R}; 
and relation B equals {(i,j, k) | 0 < ( j — i) mod n < (k — i) mod n}. Relation 
B stands for “betweenness”: B(i,j,k) holds if when we walk around the cone 
in clockwise order starting from element nr. i, we meet element nr. j before we 
meet element nr. k. 

Under the above view we can evaluate C-sentences on cones. For example, 
the cone (RLLRL) satisfies the C-sentence 



3x3y3z3u(R(x) A L(y) A R(z) A L(u) A B(x, y, z) A B(z, u, a;)). 
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The logic CC. Cone logic is first-order logic over the infinite vocabulary consisting 
of the constant symbol oo, the unary relation symbol S', and all unary relation 
symbols of the form [7], with 7 a C-sentence. 

A C£-formula can be evaluated on a database A in the following way: 00 is 
interpreted by the point at infinity; S{p) means that p is a point belonging to 
A\ and [7](p) means that the cone of p in A satisfies 7. Variables and quantifiers 
range over the points in the plane. 

Since the cone structure of a database is left invariant by isotopies, we have: 



Proposition 1. Every property expressed by a CC-sentence is topological. 

We also note: (proof delayed to the next section) 

Proposition 2. For every C C-formula there is an equivalent FO[Il]-formula. 

The natural question now arises: is every topological property expressible in 
FO[R] also expressible in CC? We investigate this problem, which we call the 
first-order completeness of CC, in the following sections. 

4 Circular Languages 

Let us call a circular language any cone property (i.e., a set of cones) that does 
not contain the two special cases of the full cone and the empty cone (these can 
be treated separately) . So a circular language is a set of non-empty circular lists 
of L’s and i?’s. 

A circular language T is called FO[H]- definable if there is an FO [R]-formula 
p(x, y) such that for each database A and each point (xq, yo) G A, A \= p[a;o, po] 
iff the cone of {xq, yo) m A belongs to T. We are going to show: 

Theorem 2. Every FO\Fi\- definable circular language T is definable by a C- 
sentence. 

Before we sketch the proof, we remark that it is easy to characterize the C- 
definable circular languages. Let T be an arbitrary set of words over the alphabet 
{L, i?}. We can turn T into a circular language by circularizing every word 
in T . It is well known m how words over the alphabet {L,R} can be viewed 
as finite structures over the vocabulary consisting of the unary relation symbols 
L and R, and the order predicate <. The first-order definable sets of words are 
then precisely the star-free regular languages. Using this fact, the following is 
not difficult to see: 

Proposition 3. The circular languages definable by C -sentences are precisely 
those of the form with T a star-free regular language over {L,R\. 

Using this fact, we can prove Proposition El using induction on the star-free 
regular expressions. We omit the details. 

We now present: 
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Fig. 2. Construction of A{A). 



Proof of Theorem Q (Sketch) Let tp be the FO[R]-formula defining T. Let 
T'^ord be the set of all words over {L, R} that belong to T when viewed as 
circular lists. 

Consider finite (L, i?)-structures A over the reals, with L and R unary re- 
lation symbols. We first show that “Z\ S T'"°''d” (meaning that the elements 
of and when scanned from right to left, spell out a word in T'"°''d) is 

expressible by a first-order sentence over the vocabulary (L, i?, <), where < is 
the real ordering. Indeed, we can find an FO[R]-formula (over the vocabulary 
(0, 1, -I-, X , <, L, i?)) that defines, on any A, a database A{A) that is conical with 
top (0, 1) and that contains through each element of or R'^ (embedded on 
the x-axis of R^) a line or triangular strip. This is illustrated in Figure 0for 
= {0,2,3} and R^ = (Ij. 

Then “Z\ € jg equivalent to A{A) \= 1]. The latter sentence is also 

invariant under monotone bijections from R to R. Hence, by a collapse theorem 
by Benedikt, Dong, Libkin and Wong [2], the sentence is equivalent to a sentence 
ip over (L,R,<). Moreover, we may assume without loss of generality that the 
quantifiers in ?/; range only over the active domain of the given structure nHEi- 
Consider now the C-sentence 7 = 3/?/;', where ip' is obtained from ip by 
replacing all occurrences of x < y by B(f,x,y). Then 7 defines the circular 
language T. □ 

5 Region Databases 

In this section we focus on region databases, defined as databases in which the 
cone of every point is either F or consists exclusively of i?’s. Intuitively, such 
databases contain only “filled” figures. We are going to show: 

Theorem 3. CC is first-order eomplete on the class of region databases. 

We will prove TheoremEJby establishing a connection between topological prop- 
erties and logical properties of abstract graphs. 
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Fig. 3. At the top, a database A; at the bottom, graph(A). 



Let A be a (fully two-dimensional) database, and let p be a singular singular 
point in A. The number of R's in the cone of p in A is called the degree of p in 
A and denoted by degp. We can now associate an abstract directed graph to A, 
denoted by graph(A), as follows (an illustration is given in Figure 0|). The set of 
nodes of graph(A) equals the union of the set of singular points in A with the 
set {(p, i) I p a singular point in A and 1 < i < degp}0 The nodes in the first 
set are called parent nodes; the nodes in the second set are called child nodes. 
There is an edge from each node p to each node (p, i). 

The graphs that equal graph(A) for some A are called the depth-one forests. 
Also note that, by Theorem^ if graph(A) and graph(_B) are isomorphic then A 
and B are topologically elementary equivalent. 

We view directed graphs in the usual manner as structures over the vocabu- 
lary consisting of a single binary relation symbol E; the domain equals the set 
of nodes, and relation E equals the set of edges. We refer to first-order logic over 
this vocabulary {E} as FO of graphs. To prove Theorem|21 we will also need to 
talk about ordered graphs. These are graphs with an additional order predicate 
<, which is an arbitrary linear order on all nodes. First-order logic over {E, <) 
will be referred to as FO of ordered graphs. 

Let G be some class of graphs. An FO sentence p of ordered graphs is called 
order-invariant over G if it does not distinguish between different orderings of 
the same graph in G. It is well known (e.g., 0 Exercise 17.27], Proposi- 
tion 2.5.6]) that in general, order-invariant FO sentences of ordered graphs are 



10 



For simplicity of presentation in this section, we ignore the point at infinity. It can 
be accommodated for by adding a few technicalities. 
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more powerful than standard FO sentences of graphs. However, for G the class 
of depth-one forests, we can prove: 

Proposition 4. Every FO sentence of ordered graphs that is order-invariant 
over depth-one forests is equivalent (on depth-one forests) to a standard FO 
sentence of graphs. 

Proof. (Sketch) We rewrite FO formulas working on depth-one forests in a 
“many-sorted normal form” where variables range either only over parent nodes 
or only over child nodes. An ordering of a depth-one forest is called “canonical” 
if all parent nodes come first in the order, ordered by increasing number of 
children, followed by the child nodes ordered according to their parents. 

We can show that on canonically ordered depth-one forests liJ every FO for- 
mula is equivalent to a quantifier-free one over the expansion of the vocabulary 
(A, <) with the following constants, functions and predicates. For each n, we 
have constants for the minimal parent having at least n children and the max- 
imal parent having at most n children, as well as constants for the globally 
minimal and maximal parent. For each n, we have the binary predicate among 
parent nodes that the number of nodes between them in the order is at least 
n. We have functions giving the minimal and maximal child of a parent node. 
We have the sibling relation among child nodes. We have functions giving the 
minimal and maximal sibling of a child node. Finally, for each n, we have the 
binary predicate among child nodes that the number of nodes between them in 
the order is at least n. 

The quantifier elimination procedure proceeds inductively as usual, distin- 
guishing between parent variables and child variables in eliminating quantifiers, 
and adding the necessary extra predicates where needed to preserve equivalence. 
A quantifier-free sentence in this expanded logic can only talk about bounds on 
cardinalities of sets of parents defined by bounds on their number of children. 
Such properties are already expressible in standard FO of graphs. □ 

Having Proposition at our disposal, we can now establish the following 
upper bound on the topological properties expressible in FO[R]: 

Lemma 1. For every topological FO[R] -sentence (p there exists an FO sentence 
Ip of graphs such that for each database A, A \= Lp ijff graph(A) ^ ip. 

Proof. (Sketch) Given a depth-one forest G embedded in the reals, we can con- 
struct a database A{G) topologically elementary equivalent to any database A 
for which graph(A) and G are isomorphic. This construction is illustrated in 
Figure 0 Actually, we can even find an FO[R]-formula 9 (mentioning the re- 
lation symbol E) that performs this construction (i.e., that defines A{G) given 
any G). Hence, the sentence ip being the composition of 9 with tp is the wanted 
sentence, if it were not for the fact that ip is not in FO of graphs but rather 

Considering order-invariant sentences, it is sufficient to restrict attention to depth- 

one forests that are canonically ordered. 
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in the much richer logic FO[R]. However, ip is order-generic, so by the collapse 
theorem already used in the proof of Theorem 0 we may assume it to be over 
the vocabulary (E,<) only. Again, we may even assume that quantifiers in ip 
range over the active domain only; hence ip is an FO sentence of ordered graphs. 
Finally, since ip is order-invariant, applying Proposition 2] yields the desired FO 
sentence of graphs. □ 

^ ^ ^ 

Fig. 4. The database A{G) for some real embedding G of the graph of Figure El 
The parent nodes are placed on the real axis at the positions of their correspond- 
ing real numbers. The children are then drawn around the parents as regions, in 
the directions obtained from their corresponding real numbers (after a scaling 
from R to R+). Note that A{G) is topologically elementary equivalent to the 
database of Figure |3 this is a crucial property of the construction. 



Theorem El now follows immediately from Lemma E and the following coun- 
terpart to it: 

Lemma 2. For every FO sentence ip of graphs there exists a CC-sentence tp 
such that for each database A, graph(A) \= ip iff A \= ip. 

Proof. (Sketch) From the proof of Proposition El (which dealt with order-invar- 
iant FO sentences of ordered graphs and thus certainly applies to standard FO 
sentences of graphs) , we may assume ip to be quantifier-free, provided extra pred- 
icates are provided for talking about bounds on cardinalities of sets of parents 
defined by bounds on the number of their children. Translated from graph(A) to 
A this represents bounds on the cardinalities of sets of singular points defined 
by bounds on their degree. But such properties are expressible in CC. □ 

6 Discussion 

The most obvious direction for further research is to extend our completeness 
result for CC from the class of region databases to the general class of all (closed) 
databases. The point where our proof fails for the general case is the construction 
illustrated in Figure El where we construct, in FO[R], from any real embedding of 
graph(A), for any region database A, a database topologically elementary equiv- 
alent to A. If A is not a region database, we cannot simply draw the children as 
regions emanating from their parents, as done in the figure; now some children 
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have to be drawn as lines. The endpoints of these lines must be pairwise con- 
nected; they are unwanted extra singular points and cannot be left “dangling.” 

The problem, however, is that this seems impossible to do in FO[R]. 
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Abstract. One of the most important advantages of constraint data- 
bases is their ability to represent and to manipulate data in arbitrary di- 
mension within a uniform framework. Although the complexity of query- 
ing such databases by standard means such as first-order queries has been 
shown to be tractable for reasonable constraints (e.g. polynomial), it de- 
pends badly (roughly speaking exponentially) upon the dimension of the 
data. A precise analysis of the trade-off between the dimension of the 
input data and the complexity of the queries reveals that the complex- 
ity strongly depends upon the use the input makes of its dimensions. 
We introduce the concept of orthographic dimension, which, for a con- 
vex object O, corresponds to the dimension of the (component) objects 
Oi, ..., On, such that O = Oi x • • • x On- We study properties of databases 
with bounded orthographic dimension in a general setting of o-minimal 
structures, and provide a syntactic characterization of first-order ortho- 
graphic dimension preserving queries. 

The main result of the paper concerns linear constraint databases. We 
prove that orthographic dimension preserving Boolean combination of 
conjunctive queries can be evaluated independently of the global dimen- 
sion, with operators limited to the orthographic dimension, in parallel 
on the components. This results in an extremely efficient optimization 
mechanism, very easy to use in practical applications. 



1 Introduction 

The recent field of constraint databases, initiated at the beginning of the decade 
IRKRfiOI . has lead to sound data models and query languages for multi-dimen- 
sional data jPVV94IGS r94pKG94IKFV95;GK97j. It allows to represent infinite 
relations of arbitrary dimension by quantifier-free formulae over some arithmeti- 
cal domain, and to manipulate these relations in a symbolic way. There have 
been many theoretical studies on constraint databases, mostly focused on the 
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the french CNRS GDR CASSINI. First author partly supported by IASI CNR in 
Rome. 
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data model and on fundamental issues (expressive power and complexity) per- 
taining to the associated query languages. More recently, prototypes have started 
to emerge, showing the practical relevance of the constraint paradigm. Target 
applications are mainly spatial and temporal databases, where the expected im- 
provement over traditional approaches follows from the formal framework which 
provides sound foundations for data modeling and query language design. 

An important feature of constraint databases is their ability to handle in a 
uniform way pointsets in arbitrary dimension. The model is thus a natural candi- 
date for applications which manipulate 3d objects (CAD), spatio-temporal data, 
or more generally for any scientific domain handling high dimensional pointsets. 
Although the complexity of querying such databases by standard means such 
as first-order queries has been shown to be tractable for reasonable constraints 
(e.g. polynomial constraints), it depends badly (roughly speaking exponentially) 
upon the dimension of the data. This seems to put a severe restriction on the 
ability of the forthcoming constraint database systems to handle efficiently high- 
dimensional relations. 

In this paper we study techniques to overcome this problem which are based 
on restrictions of the geometry of the spatial objects allowed. They are express- 
ible as conditions on the formulae representing the pointsets. We investigate how 
the query evaluation process can take advantage of these restrictions by manip- 
ulating the objects through their projections on lower dimensional subspaces, 
with a complexity which depends only linearly upon the dimension. 

We first consider relations with loose orthographic dimension as relations 
containing objects which are equal to the join of their projections on subspaces 
of dimension They can be expressed by formulae over a first-order language 
with constraints, such that each atomic subformula involves at most £ variables. 
Unfortunately, this class is not closed under fundamental operations such as 
projection. 

We thus investigate a more drastic restriction, the (strict) orthographic di- 
mension, orthodim, of the data. A relation of dim d has orthodim bounded by 
£ if it can be represented by a formula with d variables, such that there is a 
partition of the set of variables into (say k) components of at most £ variables, 
such that each constraint in the formula can only involve variables from a single 
component. It is easy to see that a relation of orthodim bounded by ^ is a finite 
collection of objects O, such that O = ttciO x • • • x where TTCiO is the 

projection of O on the ith component. 

We consider the problem of deciding if a constraint relation has a given 
orthodim, or more generally, if it can be rotated to satisfy some given orthodim. 
We study the decidability of these problems in terms of the context structure, 
and prove that both problems are tractable for linear constraints. 

We then consider first-order queries over databases of bounded orthodim. 
We first show that although it is undecidable in general if a query preserves 
the orthodim of its input, the class of preserving queries can be syntactically 
characterized. The characterization is rather ad hoc. We were able to define a 
natural class of safe queries that preserve the orthodim. 
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Finally, we focus on linear constraint databases. We consider the symbolic 
algebra, manipulating the generalized relations, and simulating operations of 
classical relational algebra over infinite relations. A detailed study of the com- 
plexity of the operations reveal that the evaluation of first-order queries depends 
exponentially upon the global dimension of the input data. 

The main result of the paper shows that for safe Boolean Combination of 
Conjunctive Queries (BCCQ), the complexity of the evaluation depends expo- 
nentially upon the orthodim, and only linearly upon the global dimension. 

The proof of this result is far from trivial. First note that among the oper- 
ations of relational algebra, selection is the only one that does not preserve the 
orthodim since it might introduce a dependency between variables of different 
components. All other operations preserve the orthodim. We define ALG^ as 
the set of queries which can be expressed with operations restricted to inputs 
of dimension at most the orthodim, including an approximated selection, which 
preserves the orthodim. We show that queries in ALG^ capture at least the full 
expressive power of safe BCCQ. 

The technique presented in this paper has been implemented in the dedale 
system IKlB.SHHbl which can handle objects of orthodim 2 of any dimension d. 

The paper is organized as follows. Section 2 introduces and studies the notion 
of orthographic dimension. Section 3 introduces a class of queries preserving the 
orthographic dimension of their inputs, and Section 4 focuses on the linear case. 

2 Orthographic Dimension 

We consider databases in the context of well-behaved infinite structures. We 
assume some first-order language £, consisting of two interpreted predicate sym- 
bols, equality and order, and interpreted function and constant symbols. In the 
sequel, we consider an arbitrary /1-structure A with universe A. We will also make 
the assumptions that the structure A is o-minimal |V IVI IVIt)4j and admits quan- 
tifier elimination. A structure A is o-minimal if every definable set, {a;|(/3(a;)}, 
with ip a first-order formula in £, is a finite union of isolated points and open 
intervals. A admits quantifier elimination if for every first-order formula <^(5;), 
there exists an equivalent quantifier free formula 'f’{x) (i.e. A |= \/xip{x) fp{x)). 
Examples of structures of interest in the present context are: the rationale with 
addition, (Q, -b, 0, 1), and the real polynomial arithmetic, (K, -b, x , 0, 1). 

Both structures are o-minimal, admit quantifier elimination, and have decidable 
first-order theories. 

A (database) schema s is a finite set of relation symbols such that s H £ = 0. 
We always assume that the schema is disjoint from the first-order language, and 
we distinguish between logical predicates (such as =, in £, and relations in s. 

We next define the finitely representable databases in the context of some 
£-structure A. Kanellakis, Kuper and Revesz fKKB,95j introduced the concept 
of a d-ary generalized tuple., which is a conjunction of atomic formulas in £ 
with d variables. For instance in the context of the real numbers, the expression 
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= 1) A (x ^ 0) is a binary generalized tuple representing a half circle 
in the real plane. A d-ary generalized relation is defined by a finite set of d-ary 
generalized tuples. In this framework, a tuple [a, b] of the classical relational 
model [lJII88IMa,i8S| is an abbreviation of the formula (x = a Ay = b) involving 
only the equality symbol and constants. 



Definition 1 Let ip{x) be a formula in C with d distinct variables Xi, . . . , x^. A 
d-ary relation S C A'^ is represented by over the £-structure A if 

\/d € A,A\= ip(a) iff d G S 

S' is a finitely representable relation represented by ip over A, and is a finite 
representation of S over A. The attributes of S are denoted by the corresponding 
variables of p. 



Consider a d-ary relation R represented by a quantifier free formula in dis- 
junctive normal form p, of the form: 

71 £i 

= W /A 

i=i i=i 

where the Pi,j’s are atomic formulas in C. Then, we also write the representation 
p as a, collection of generalized tuples U in the set notation: 

tz I 1 ^ ^ ^ 

A finitely representable database instance I over a schema s is a collection of 
finitely representable relations, each associated to a relation name in the schema. 
In the sequel, we often use the word ’’object” to denote finitely representable 
sets of points not associated to a relation name. 

Because the structure A admits quantifier elimination, a relation is finitely 
representable iff it is finitely representable by a quantifier free formula. Therefore 
the restriction to quantifier free formula does not limit the definable relations. 
The use of quantifier free formula in DNF to represent relations, widely adopted 
in constraint databases, has a serious impact on the data complexity as was 
originally noticed in [IKKR.QOj . In the present paper, we consider, in addition, 
the impact of the restrictions on the use of variables in the formulae on the 
complexity of query evaluation. Unless otherwise stated, we assume that all 
relations are finitely representable. 

We introduce a first restriction based on the number of distinct variables in 
each atomic formula, and corresponding to the classical notion of orthographic 
projections. 

Definition 2 A quantifier-free formula p{x) in C with d distinct variables 
xi, . . . ,Xd has loose orthographic dimension I if each atom in p[x) involves at 
most ^ distinct variables. 
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A relation S has loose orthographic dimension £ if there exists a representa- 
tion of S with a formula of loose orthographic dimension £. Note that £ is not 
unique, and if a relation has loose orthographic dimension £, it has also loose 
orthographic dimension (£ + 1). 

A convex d-dimensional object O with loose orthographic dimension £ can 
be seen as the join of all its projections on the (^) ^dimensional subspaces. For 
example, if O is a 3-dimensional object with loose orthographic dimension 2, 
then O = TTx^yO N tt^^zO N iTy^zO. That is the object O can be characterized by 
its orthographic projections. Figure CJa shows an example of such an object. 




Fig. 1. 3-d objects with bounded orthographic dimension 



Unfortunately, the class of relations with loose orthographic dimension £ 
is not closed under fundamental operations such as for instance the projection. 
Consider the relation S in Q® defined by the formula x — y — z = 0 A x — t—u = 0. 
S has loose orthographic dimension 3. Its projection on y,z,t,u is defined by 
the formula y + z — t — u = 0 which has loose orthographic dimension 4, but is 
not equivalent to a formula with loose orthographic dimension 3. 



We therefore propose to consider a more drastic restriction which ensures 
better closure properties, and tighten the conditions on the use of the variables. 

We first define the notion of dependent variables with respect to a formula. 
This notion was studied in |<!CKhblJ in the context of linear constraint databases. 
Let ip{x\ , . . . , Xd) he a formula in £ with d distinct free variables x\, . . . ,Xd- Two 
distinct variables which occur in the same atom in (p{xi , . . . , Xd) are said to be 
dependent in the same atom. The dependency relation between variables in 
if is the reflexive symmetric transitive closure of the dependency in the same 
atom relation defined above. Variables which are not dependent are said to be 
independent. The orthographic partition of the set of variables of a formula is 
the partition in equivalence classes of the dependency relation. We can now 
introduce the concept of orthographic dimension. 
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Definition 3 A quantifier-free formula in C with d distinct variables 
xi,...,Xd has orthographic dimension (orthodim) i if each class of its ortho- 
graphic partition has cardinality at most £. 

It is now possible to define the orthographic dimension of a relation. 

Definition 4 A relation S is of orthographic dimension I if there exists a rep- 
resentation of S with a quantifier-free formula ip of orthographic dimension 1. 

Note that, as for the loose orthographicity, the orthographic dimension of a 
relation is not uniquely defined. We do not consider the intrinsic unique (e.g. 
minimal) orthographic dimension of relations, but merely, given £, the relations 
that are of orthographic dimension £. 

To a relation, we can associate a partition of its attributes as follows. 

Definition 5 A relation S admits an orthographic decomposition V, if there 
exists a representation of S with a formula ip of orthographic partition V. The 
subsets of the partition are called the components of the decomposition. 

Note that the orthographic decomposition of a relation is not unique. Indeed, 
a relation can be defined by different formulae with distinct orthographic parti- 
tions. It can be shown, however, that the orthographic partitions of a set of k 
variables form a lattice for the sub-partition relation. Nevertheless, we do not 
consider a unique (e.g. thiner) decomposition associated to a relation, but con- 
versely, given a fix decomposition, the relations that admit this decomposition. 

A convex d-dimensional object O with orthographic decomposition V is equal 
to the Cartesian product of all its projections on the components of V. Figure 
□b shows the example of such an object. 

Note that for £ = 1, the orthographic dimension 1 coincides with the loose 
orthographic dimension 1. In this case, we say that a relation is rectangular. It 
admits a definition ip with exclusively constraints of the form: xOa, where a is 
a constant, and 0 is a predicate. 

We can now prove the following proposition which relates the orthographic 
decomposition to the rectangularity. 

Proposition 1 A relation R admits an orthographic decomposition 
V = (Pi, . . . , Pji), iff for each pair of variables x, y, such that x G Pi, and y G Pj, 
with i ^ j, and all interpretation 6 of the attributes of R distinct from x, y, the 
set Se = {a:, y \ R{- ■ ■ ,x, - ■ ■ ,y, ■ ■ -jd} is a rectangular relation. 

Therefore the orthographic decomposition can be reduced to the rectangular- 
ity of the projections on planes of independent axis. In the context of o-minimal 
structures, the boundary of a relation is definable in FO{<). A binary relation 
is rectangular if each point in its boundary is either (i) an isolated point, (ii) 
a point inside a vertical or horizontal segment, or (iii) a point on the corner of 
a rectangle. Each of these properties can be expressed easily by a formula of 
FO{<). The above development together with Proposition [Q lead to the follow- 
ing proposition. 
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Proposition 2 There is a FO{<) formula fj,-p such that for all instance I, nv{I) 
is true iff I admits the orthographic decomposition V . 

The relations of bounded orthographic dimension are very sensitive to var- 
ious transformations. We consider the closure of the class of relations of some 
fixed orthographic dimension through algebraic operations such as set operations 
(union, intersection, set difference), projection, selection, and transformations, 
such as translation, rotation, etc. The selection operator can relate by a single 
constraint a group of independent variables, thus possibly modifying the ortho- 
graphic dimension of the input. For instance, if x and y are two independent 
variables in a relation R, then in ax^y{R), x and y are in general dependent. 

Note that the class of relations of orthographic dimension £, in the context of 
the real field, is closed under set operations, projection, translation, symmetries 
(axial), but not under rotations. 

Although rotations do not preserve the orthographic dimension, there are 
relations originally not of orthodim i, but which can be rotated adequately to 
become of orthodim £, that is such that there exists a vector basis in which the 
relation has orthodim i. 

We consider the general problem of deciding if a relation is of orthodim 
i. Finitely representable relations are defined and manipulated by means of 
quantifier-free formulae. In general these formulae need not be of orthodim i 
even if the intended relation is. In such a case however, they are equivalent to 
some formula of orthographic dimension £, defining the relation of correspond- 
ing orthographic dimension. It is important to determine if this is the case. We 
consider the two following problems. 

1. Explicit orthographic dimension Given a quantifier-free formula in C 
with d distinct variables, is there an equivalent formula with orthodim £ ^ d. 

2. Implicit orthographic dimension Given a quantifier-free formula in C 
with d distinct variables, is there a linear transformation of the system of 
coordinates, in which an equivalent formula with orthodim £ ^ d can be 
found. 

The decidability of these two fundamental problems is characterized in the 
sequel. The following result can be obtained using Proposition El 

Theorem 3 Let A be an £-structure which has a decidable first-order theory. 
Then it is decidable if a quantifier-free formula in £ is equivalent to a formula 
Ip with orthographic dimension £. 

The following can be shown for the implicit orthographic dimension. 

Theorem 4 Let A be an enumerable £-structure which has a decidable first- 
order theory, and such that addition and multiplication are also decidable. Then 
it is decidable if a quantifier-free formula in £ is equivalent, modulo a linear 
transformation of the system of coordinates, to a formula ip with orthographic 
dimension £. 
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More precise results can be obtained for specific choices of the language £, 
and the structure A. For instance in the case of linear constraints over ratio- 
nal numbers, a polynomial time bound can be obtained for both orthographic 
dimension problems. The following result generalizes the result of EHM- 

Theorem 5 The problem of explicit and implicit orthographic dimension of a 
linear constraint relation defined by a linear formula <p can be solved in polyno- 
mial time. 

3 Query Languages 

We consider queries which can be expressed as first-order formulae in the lan- 
guage £ in the context of A. If s is a schema, each formula (p in TUs with free vari- 
ables Xi, . . . ,Xn (n ^ 0) defines a query over s in the context of A, mapping in- 
stances / of s to n-ary relations defined by {(ui, . . . , a„) | AUl ^ <p(ai, . . . , a„)}. 

Suppose q = {(xi, . . . , Xn) I v^} is a query over s, and / is an instance of s. 
Since each relation Rin I can be defined by a quantifier-free formula in £, and tp 
is a formula in £ U s, we can replace in tp each occurrence of the relation symbol 
i? S s by a formula defining R. The resulting formula p' is a formula in £ with 
no reference to relation symbols in s, which defines the answer to the query q on 
/, denoted by q{I). 

For complexity reasons, we consider as usual answers defined by a quantifier- 
free formula '0 in £ such that p' and 0 are logically equivalent in A. We assume 
in the sequel that the structure A admits effective quantifier elimination. 

As mentioned in the previous section, there are queries manipulating data in 
dimension d with orthodim £, which do not have an output of orthodim i. This is 
the case for instance of axp.y{R), where x and y are variables from two different 
components of the orthographic decomposition. We restrict our attention to 
queries preserving the orthographic decomposition. 

Definition 6 Let s be a database schema, and V be an orthographic decompo- 
sition of the relations in s. A query Q over s preserves the orthographic decom- 
position V if for each instance / of orthographic decomposition V, q{I) admits 
an orthographic decomposition which refines V. 

We consider also a generalization of the previous preservation property with 
no reference to a specific decomposition. A query Q over s preserves the or- 
thographic decomposition if for each instance / of orthographic decomposition 
V, q{I) admits an orthographic decomposition which refines V. It is clear that 
if a query preserves the orthographic decomposition, it also preserves a given 
orthographic decomposition V. 

As for many other preservation properties of queries jVGV96II3L98j , we prove 
that the preservation of the orthographic decomposition is undecidable in a o- 
minimal context structure, and for a schema with at least a binary relation 
symbol. 
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Theorem 6 Let A be an o-minimal context £ structure, and s be a schema 
with a relation symbol of arity at least 2. Then it is undecidable if a formula in 
£ U s expresses a query over s which preserves the orthographic decomposition. 

Proof : The proof is done by a reduction from the problem of satisfaction of 
a Boolean query, which was shown to be undecidable under the assumptions of 
the theorem pson|. Consider a schema with a single relation i? of arity n with 
a non trivial orthographic decomposition. The formula: 

/yy ^ Xj A <p(E) 

0<i<j<n 

defines a query which preserves the orthographic decomposition of the input iff 
is unsatisfiable. □ 

If we restrict our attention to the class of conjunctive queries, then preserving 
the orthographic decomposition becomes decidable. 

Proposition 7 It is decidable whether a conjunctive query preserves the ortho- 
graphic decomposition. 

Proof : (sketch) Let g be a conjunctive query. It can be recursively changed into 
an equivalent query of the form : 

TTa CFp Ri X ... X Rn 

where A is a set of attributes and F a conjunction of constraints. 

Then it suffices to compute the non rectangular connections between pairs 
of variables from distinct components introduced by F . These are all pairs of 
variables dependent in F, but those introducing a rectangular connection. The 
connection between two variables x, y is said to be rectangular if the projection 
over the plane defined by x, y of crp Ri x x is a rectangular relation. Since 
this property can be expressed in first-order logic and that the structure admits 
effective quantifier elimination, it is decidable. 

We then check whether all connections between components added by F are 
destroyed by the projection on A. If this is the case, the query clearly preserves 
the orthographic decomposition. If this is not the case it is possible to construct 
an instance I such that q{I) has an orthographic decomposition which does not 
refine the one of I. □ 

We now provide a syntactic characterization of the set of orthographic de- 
composition preserving queries. Let /i-p be a Boolean formula (cf. Proposition EJ 
which holds if (f) defines a relation that admits decomposition V. 

Let s be a schema, V = (Pi,...,P„) an orthographic decomposition, and / 
an instance of s, represented by a formula ^/, which satisfies the orthographic 
decomposition V. Consider a formula y}{yi, ■ ■ ■ ,yk) in £ U s, and assume wlog 
that {yi , . . . , yk\ Q {x \, . . . , Xd\- A formula of the form 

<p(?/i,...,yfe)A(^p(^>/) ^ 

is called decomposition V restricted. 

The following proposition follows easily from Proposition El 
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Theorem 8 The class of orthographic decomposition preserving queries and 
the class of decomposition restricted queries coincide. 

The definition of decomposition restricted formulae is of little use in practice. 
It is not natural to write queries in this form. We therefore propose a second 
restriction which is intuitive and easy to check. 

In order to give the next syntactic restriction, we consider an algebra, equiv- 
alent to first-order logic. The algebra consists of the following operations: Car- 
tesian product, X, projection, tt, union, U, set difference, — . The algebra op- 
erations, whose effect is described below, are performed on sets of generalized 
tuples. Let R\ and R 2 be two relations, and respectively ei and 62 be sets of 
generalized tuples defining them. 

1. Ri X i?2 = {tl A t2 \ ti G 61,^2 G 62 }. 

2. apiRi) = {tl A F \ tl G d} where F is any atom of C. 

3. TTx Ri is computed using the algorithm of quantifier elimination0. 

4 . Ri U i?2 = Cl U 62- 

5 . — i?2 = {tl At2 \ tl G Ci,t2 G (62)'^}, where is the set of tuples or 

disjuncts of a DNF formula corresponding to ->6. 

6. simplify{Ri), eliminates redundancies and detects inconsistencies. 

The symbolic operations above are well defined in the sense that their effect 
on the intensional definition of sets corresponds exactly to the semantics of the 
corresponding relational operators from the relational algebra over the possibly 
infinite extension of the sets. The relational intersection and join are definable 
with the symbolic Cartesian product. The sets of variables are different in these 
three operations. For the Cartesian product, the variables of the two relations are 
disjoint, they are similar in the case of the intersection, and with a non empty in- 
tersection in the case of the join and the selection. The simplify operator, given a 
conjunction of constraints, eliminates redundancies, and detects inconsistencies. 
Simplification is necessary for checking the satisfiability of formulae in A. 

All the logical operators, except selection, preserve the orthographic decom- 
position and can be carried out independently on each component of the input. 
For instance, the intersection of two objects can be processed by composing, 
at the tuple’s level, the intersection of the corresponding components. Union, 
set difference and simplification can also be carried out independently on each 
component. Cartesian product does not affect the components, and projection 
applies to the appropriate components. 

One way to be sure that a query preserves the orthographic dimension is 
to forbid selections introducing a binding between components. But this would 
limit in a drastic way the expressive power of the language. Some queries pre- 
serving the orthographic dimension might need intermediate result which have a 
higher dimension. Indeed queries like cFx<y{R) where x and y are independent 
variables would not be expressible. 



^ In the linear case this is done by the Fourier- Mot zkin elimination method IScEHSl. 
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We now introduce a class of restricted algebraic expressions preserving an 
orthographic decomposition V which allows intermediate higher dimension. Ba- 
sically, it consists in checking that when a selection introduces a binding between 
independent variables, this binding is further eliminated through a projection. 

Let E be an algebraic expression. We recursively define the collection of bad 
bindings, as the set BB{E) of pairs of components which are linked at each step 
of the computation process. 

1. If if is a relation name or variable, BB{E) = 0. 

2. If if = apiEi), then BB{E) is the union of BB{Ei) with the collection of 
pairs < C'i, Cj > such that a; is a variable of Q, y is a variable of Cj and x 
and y are bound in E, where Ci and Cj are distinct components of V . 

3. If if = TTxi^,...,xi^{E\), then BB{E) is obtained by removing from BB{Ei) 
all pairs p such that one of the components in p is only defined with variables 
that don’t appear in {xi^,. . . , Xi^}. 

4. If if = Ei0E2 with 0 among {U,n,— ,x}, then BB{E) = BB{Ei) U 
BB{E2). 

Now if E is an algebraic expression such that BB{E) = 0, if is said to be 
bad-binding free. 

Definition 7 V-Safe queries are defined as the set of bad-binding free algebraic 
queries. 

The following is immediate : 

Proposition 9 Every V-safe query preserves the orthographic decomposition. 

The natural question is whether this restriction captures all the orthographic 
decomposition preserving queries. We can only give a partial answer to this 
question. 

Theorem 10 Every conjunctive query which preserves the orthographic decom- 
position V is equivalent to a V-safe query. 

Proof : (sketch) Let be a conjunctive query preserving the orthographic di- 
mension. It can be recursively transformed into a equivalent query if of the form 
tta<Jf{R) where i? is a cross product. If BB{tp) = 0 we are done. If this is not 
the case, it means that there are variables which are connected by F but which 
have a rectangular connection. Let x and t be such variables. In if) we replace 
equivalently the selection criteria F by F' which is the formula 3t F A 3x F, 
after quantifier elimination. This can be done because E is a conjunction of 
constraints and represent a convex set. By repeating this process we get an 
equivalent conjunctive query with an empty BB. □ 

It is possible to extend the above theorem to the disjunction of conjunctive 
queries. But it is not clear how to extend it to more general classes of queries 
like Boolean combination of conjunctive queries. 

In the following section, we restrict our attention to a popular context struc- 
ture of practical interest, and show the importance of the orthographic dimension 
for query evaluation. 




210 Stephane Grumbach, Philippe Rigaux, and Luc Segoufin 



4 Linear Constraint Databases 

This section is limited to the case where the context structure is A — (Q, ^, + )• 
The finitely representable relations are called linear constraint relations. We 
study the impact of the various parameters which characterize a linear instance 
(such as the number of variables, the orthographic dimension, the number of 
constraints, etc.) on the complexity of queries, and show that the evaluation 
process can take advantage of the orthographic dimension of an input. We exhibit 
queries on d-dimensional databases which can be solved using only manipulation 
of ^-dimensional pointsets. 

We first consider the complexity of the algebraic operations. We consider an 
algebra, containing the operators, x, tt, U, — , cr and simplify presented above. 

The DNF formulae representing pointsets are structured with the following 
constructs: (i) a top-level disjunction, (ii) conjunctions, (iii) predicates of the 
atomic constraints, and finally (iv) the parameters of the constraints. 

The algebraic operations apply to different levels of the formulae. U, x, and 
(jp apply at the level of the logical connectives and can be evaluated in a purely 
symbolic way independently of the dimension. The other operations have an 
effect on lower levels. Their complexity depends strongly upon the dimension of 
the input. 

— Set difference has an effect at all levels till the predicates of the constraints, 

which can be modified (e.g. < replaced by >). It can be computed by first 
computing the cell decomposition of the space induced by all the constraints 
that occur in both relations, and then checking which cells are in the result. 
Cell decomposition can be computed in time complexity IC(J971 . 

— Projection has an effect at all levels including the parameters, and implies 
numerical computation. The Fourier-Motzkin algorithm [SchHtij can be used 
to eliminate one variable of a convex set of n facets in dimension d in time 
complexity O(n^). If k variables need to be eliminated, the time complexity is 
then 0{iA ). A more subtle algorithm would be to simplify after eliminating 
each variable, which reduces the output to a size linear in n. The overall 
complexity to eliminate k variables is then : 0{{n^+2'^ n'^)-|-...-|-(n^-|-2^ n'^)) 
which is 0(fc2^^n^). 

— The complexity of simplification (and therefore satisfaction) of n linear con- 
straints in dimension d can be found in jSch86IC0971J . It is essentially expo- 
nential in the dimension. 

Figure 0 summarizes the costs of the algebraic operators in dimension 1, 
2, 3 and d with respect to the following parameters: n is the total number of 
constraints in the relations, t is the number of tuples, and k the number of 
variables projected out. All complexities are upper-bounds given modulo a coef- 
ficient factor. The blow up in complexity comes from projection, set difference, 
and simplification. 

In order to reduce the complexity, we consider input relations of orthodim 
bounded by a given i. As mentioned in Sectional queries over such relations can 
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Fig. 2. Complexity of the operators in the dimension 



be processed independently over each component except when a selection binds 
two components. In this later case, it seems necessary to use operators which 
operate on pointsets whose dimension is higher than 

We introduce a new selection operator which preserves the orthographic de- 
composition, and can be used in place of the classical selection. The new symbolic 
selection, denoted by ctf, computes, for each generalized tuple t, the projection 
-KCi{Ff\t) for each component Ci and returns apit) = ttci {FAt)x. . ,xttc„{F At). 
Note that dp{t) is not equal to It defines an approximation of <Jp{t), such 

that the two coincide on each component. We show how this new selection can 
be used in place of the traditional one. 

To catch the intuition, consider a simple conjunctive V-safe query q{R) = 
’^z{o'p{R)), in the context of a relation of dimension 3 and orthodim 2, with 
decomposition V = ({a;, y}, {z}). We focus on the evaluation of g on a single 
tuple, O (see Figure 0a). 





Fig. 3. Selection over a 3-d object of orthodim 2 



In Figure 0b: the selection ap{0), with F = x > z, yields a poly tope O'. 
The exact value of O' is not relevant for the final output of the query which 
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is TTz{0'): O' can be approximated by (Tf{0) (in dashed lines on Figure 0b). 
Two comments are noteworthy. First, the result features a new constraint z < a 
where a is the highest coordinate of O' , B.z. Note that the point B can be simply 
computed using F and the bounding-box of O. Second, a point in (Tf(0), P for 
instance, might not belong to O', but its projections on {x,y} and {z} belong 
to TTx,yiO') and tTz(0') respectively. 

In Figure 0c, F = x>zA2x + z<2 consists of two constraints. The same 
technique applies: we approximate the result of cff{0) by the cross-product of 
its components. The new constraint in the final result is here z < 2/3, a constant 
value which results from the intersection of the half-planes x > z and 2x + z < 2. 
Note that it depends upon the query and not upon the database and can thus be 
computed only once. 

We now introduce ALG^ as the set of queries expressed with the usual algebra 
where cr is replaced by a. 

Definition 8 ALG^ is the class of queries which can be expressed by the fol- 
lowing operations: 

- U, X 

- 7T^, — and simplify^ which are the usual projection, negation, and simpli- 
fication but restricted to inputs of dimension at most £. 

- dF- 

The main interest of ALG^ is that the complexity of evaluating queries is 
linear in d and exponential in £ only. 

Lemma 11 Let X be a class of instances of dimension d with orthodim £. The 
complexity of evaluating queries in ALG^ over X is exponential in £, but only 
linear in d. 

Proof : As for the linearity in d, note that since all the operators used to 
expressed queries in ALG^ preserve the orthographic dimension, each operator 
can be independently applied in parallel to the component of its input. It remains 
to prove that the complexity of each operator is at most exponential in £. Given 
the complexities of Fig. 0 the result is immediate for all, except a. Indeed 
it seems that the computation of ttc- (F A t) for each component Ci involves 
an unrestricted projection with £ < k ^ d. We develop in the sequel an 
evaluation which avoids the costly computation of tt^. 

The first step consists in computing, for each tuple t, mbb(t), the minimal 
bounding-box of t. This can be obtained by applying simplify^ on the input of 
a. Once mbb(t) is known, a{t) is computed by: 

E{t) = t A TTCi {F A mbb(t)) A ... A 7rc„ (F A mbb{t)) 

where the Ci’s are the components. The operator tt is introduced in a logical 
formula as an abreviation. We show that E{t) is indeed equivalent to 
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1- (C) First, ifp G E{t), then p G t and p satisfies nci{F Ambb{t))A. . .Attc„{FA 
mbb{t)). Therefore each component Ci in p belongs to t^cXF At A mbb{t)) 
which is TTCi{<yF{t))- 

2. (D) If now p G dpit) = fci{F A t) x . . . x ttc„{F A t), then p G t since by 
hypothesis t is equivalent to the cross product of the projections on its com- 
ponents, and the projection of p on each component satisfies the projection 
of F. Hence p satisfies F{t) which is equivalent to apit) 



Finally, we show that F(t) can be computed in constant time for each tuple 
as follows. Assume that F = {a^x + y ^ Oq, j = 1, . . . , J} U {b’^x — y ^ bg,k = 
1, . . . , K}, where x ranges over and y ranges over Q, and that mbb{t) is 

defined by the couples x'-^^^),l = 1, . . . , d- 1) and {ymin, ymax)- Then the 

Fourier-Motzkin projection tTx{F A mbb{t)) is: 



b^x — ^ — a^x 

b’^x - ^ ymax 

al - a^x ^ ymin 
xLin ^ Xl^^^ 



for j 


= 1,. 


. . , J, k — 1, 


...,K (1) 


for k 


= 1,. 


..,K 


(2) 


for j 


= 1,. 


..,J 


(3) 


for 1 : 


= 1 ,.. 


, . , d — 1 


(4) 



The set of constraints in (1) depends only upon F and can thus be computed 
only once. Constraints in (2), (3) and (4) depends upon the bounding box and 
must therefore be computed for each tuple, but their number is linear in the 
size of F. This means that, when evaluating F{t), we can replace the costly 
computation of tt by the computation of (2) and (3) in time 0(|F|). 

The above algorithm illustrates the computation when a single variable is 
projected out. The generalized algorithm for computing a is given below. 



1. Step 1 Project F, using the Fourier-Motzkin algorithm, on each component. 
Note that this operation is done on the query and not on the database. 
For each component C, this gives a set of constraints similar to those of 
expression (1) above. See for instance the example of Figure 01c, where z < 
2/3 is obtained by applying Fourier-Motzkin onF = x>zA2x + z<2. 

2. Step 2 For each tuple t, 

(a) 2. a Simplify t with the simplify^ operator. This yields as a side effect 
mbb{t). 

(b) 2.b Compute expressions (2) and (3) above for every component Ci. 
This can be done in time 0(|F|): 

i. For each constraint cst in F, put cst in the form 

f{x\, ■ ■ .x\) 0^(x\, . . -x^m) where / and ^ are linear functions, the 
a;* are the variables of Ci and G {^, ^}. 

ii. If 0 is ^ (resp. ^), compute the local minimum (resp. maximum) L 
of using the values of mbb{t), 

hi. Output the constraint f{x\, . . . x])OL. 

(c) 2.C Finally, construct E{t) using the conjunctions of constraints in (1, 
2, 3, 4). 
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The operation a can be computed with simplify^ and an 0(1) operation. This 
concludes the proof of Lemma HD □ 

Intuitively, queries in ALG^ can be evaluated in parallel on each component. 
It remains to characterize the class of queries over databases with orthodim ^ 
which can be equivalently rewritten as queries in ALG^ . Note first that all first- 
order queries can be expressed in some ALG^ . Indeed, it is easy to see that FO 
= UfALG^. On the other hand, if V (and thus t) is fixed, it is conjectured that 
not all first-order queries on databases of orthographic decomposition V can be 
expressed in ALG^, even if the query is supposed to be V-safe. 

Consider the relation R{x, y, z) with orthographic partition ({a;, y}, {z}), the 
relation S{y, z), and the V-safe query: 

q = TTy{Try^:,{ax<ziR)) ~S) 

.4 

The subquery A is computed by Tt'y^ziR A a: < z) which is in ALG^. This 
suggests that the hierarchy of ALG^ is strict. 

However, if we restrict our attention to V-safe Boolean combinations of con- 
junctive queries (BCCQ) we can state the following fundamental result. 

Theorem 12 Let s be a database schema of orthographic decomposition V of 
orthodim i. Let g be a 7^-safe BCCQ over s. Then q can be easily rewritten in 
an equivalent query q' in ALG^. 

Proof: We first prove the result in the conjunctive case. Let g be a V-safe 
conjunctive query, where V = {Ci, . . . C„} is the orthographic decomposition of 
the schema s. As already mentioned, (see Proposition [^ , it can be put into the 
form TTA^apiRi x ... x Re)), where A admits an orthographic decomposition 
{C[, . . . C(„} which is a refinement of V. 

Lemma For each instance I of s, q{I) = TVAicTFiRi x . . . x Re)) is equivalent to 
q'{I) = TTA{ceF{Ri X ... X Re)). 

Proof of Lemma: Let t be a tuple in q{I), and t' be the tuple in x . . . x 
such that t = 7r^((Tp’(t')). Since q is V-safe, 

t = nc[{t) X . . . X Fc!^{t) = 7rc[{crF{t')) x ... x nc^{crF{t')) 

For each C' G A, there exists some Cj. G V such that (i) C' C Cj. and (ii) 
yk ^ i,G'f.n Gj^ = 0. Therefore we have: 

irc>.{t) = t^C'Mca (i)) = ^A(Trcj. (t)) 

It follows that t = TTAiTTCj (<^F(t'))) X ... X 7rA(7Tc'j^ (crjp(t'))) which can be equiv- 
alently rewritten as 

(^0 X ... X ttcj^ (t')) because the set G {!,... m}} is an ortho- 

graphic partition. This proves that t = 7r^((Ti?(t')), and thus that q{I) = q'{I). 
This concludes the proof of the Lemma. □ 
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Now, if g is a V-safe BCCQ query, is consists of a Boolean combination of 
conjunctive queries {gi, . . . g„}. Since g is V-safe, we have BB{q) — % — BB{qi)U 
. . . U BB{qn). Hence BB{qi) = 0, for i S {1, ■ • ■ n}. Each qi is safe, belongs to 
ALG^, and yields relations with orthodim i. Finally g itself involves Boolean 
operations on relations with orthodim i, and belongs to ALG^ . □ 

The new query q' can be obtained easily. Indeed, if g is a Boolean combina- 
tion of conjunctive queries of the form ttct, then q' is obtained from g by simply 
replacing all the cr by ct. We can therefore conclude that there exists an evalu- 
ation of V-safe BCCQs such that each operator manipulates only pointsets of 
dimension i. These results can be summarized as follows. 

Corollary 13 Let s be a database schema of orthographic decomposition V. 
The complexity of evaluating 7^-safe BCCQs over s depends only linearily upon 
the global dimension. 

This follows directly for Lemma El and Theorem ^3 It is open whether the 
result extends to more general queries. 

5 Conclusion 

We have introduced restrictions on the geometry of the objects contained in a 
multidimensional databases, which (i) can be easily characterized by syntactic 
restrictions on the constraint formulae that represent the database, and (ii) 
ensure better performance for the evaluation of large classes of queries such 
as V-safe BCCQs. We have indeed demonstrated that the complexity of query 
evaluation depends linearly upon the global dimension. 

We have thus restricted both the class of databases, and the class of queries 
of interest. Although both classes are of great practical interest, we can try to 
weaken the restrictions. The poor closure properties of the class of inputs with 
bounded loose orthographic dimension, lead us to consider strict orthographic 
dimension instead. Nevertheless, this class deserves more study in the case of 
relations of small dimension (e.g. d ^ 3). It can be shown that several results of 
the paper (TheoremEl SI etc.) extend to the case of such relations. 

The class of queries can also be extended in various directions. For instance, 
the subclass of V-safe queries with projection limited to projection on variables 
of a same component, enjoys also a similar property. They can be rewritten in 
ALG^ queries. 

A spatio-temporal application runs on the dedale system with objects of 
orthodim 2 . The restriction is transparent to the user, and the evalua- 

tion of V-safe BCCQs relies on 2d operations only. This allows the manipulation 
of multidimensional data at the cost of 2d data. 
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Abstract. We explore the effect of dimensionality on the “nearest neigh- 
bor” problem. We show that under a broad set of conditions (much 
broader than independent and identically distributed dimensions), as di- 
mensionality increases, the distance to the nearest data point approaches 
the distance to the farthest data point. To provide a practical perspec- 
tive, we present empirical results on both real and synthetic data sets 
that demonstrate that this effect can occur for as few as 10-15 dimen- 
sions. 

These results should not be interpreted to mean that high-dimensional 
indexing is never meaningful; we illustrate this point by identifying some 
high-dimensional workloads for which this effect does not occur. How- 
ever, our results do emphasize that the methodology used almost uni- 
versally in the database literature to evaluate high-dimensional indexing 
techniques is flawed, and should be modified. In particular, most such 
techniques proposed in the literature are not evaluated versus simple lin- 
ear scan, and are evaluated over workloads for which nearest neighbor 
is not meaningful. Often, even the reported experiments, when analyzed 
carefully, show that linear scan would outperform the techniques being 
proposed on the workloads studied in high (10-15) dimensionality! 



1 Introduction 



In recent years, many researchers have focused on finding efficient solutions 
to the nearest neighbor (NN) problem, defined as follows: Given a eollection 
of data points and a query point in an m- dimensional metrie space, find the 
data point that is closest to the query point. Particular interest has centered 
on solving this problem in high dimensional spaces, which arise from tech- 
niques that approximate (e.g., see m) complex data — such as images (e.g. 
pi ,5f2S|20pil |20f2.3y25fl sp,3j ). sequences (e.g. PHJ), video (e.g. m, and shapes 
(e.g. pi 5p31)p‘25|22j l — with long “feature” vectors. Similarity queries are performed 
by taking a given complex object, approximating it with a high dimensional vec- 
tor to obtain the query point, and determining the data point closest to it in the 
underlying feature space. 

This paper makes the following three contributions: 

1) We show that under certain broad conditions (in terms of data and query dis- 
tributions, or workload), as dimensionality increases, the distance to the nearest 
neighbor approaches the distance to the farthest neighbor. In other words, the 
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contrast in distances to different data points becomes nonexistent. The conditions 
we have identified in which this happens are much broader than the indepen- 
dent and identically distributed (IID) dimensions assumption that other work 
assumes. Our result characterizes the problem itself, rather than specific algo- 
rithms that address the problem. In addition, our observations apply equally to 
the k-nearest neighbor variant of the problem. When one combines this result 
with the observation that most applications of high dimensional NN are heuris- 
tics for similarity in some domain (e.g. color histograms for image similarity), 
serious questions are raised as to the validity of many mappings of similarity 
problems to high dimensional NN problems. This problem can be further exac- 
erbated by techniques that find approximate nearest neighbors, which are used 
in some cases to improve performance. 

2) To provide a practical perspective, we present empirical results based on syn- 
thetic distributions showing that the distinction between nearest and farthest 
neighbors may blur with as few as 15 dimensions. In addition, we performed 
experiments on data from a real image database that indicate that these dimen- 
sionality effects occur in practice (see [Bl)- Our observations suggest that high- 
dimensional feature vector representations for multimedia similarity search must 
be used with caution. In particular, one must check that the workload yields a 
clear separation between nearest and farthest neighbors for typical queries (e.g., 
through sampling). We also identify special workloads for which the concept of 
nearest neighbor continues to be meaningful in high dimensionality, to empha- 
size that our observations should not be misinterpreted as saying that NN in 
high dimensionality is never meaningful. 

3) We observe that the database literature on nearest neighbor processing tech- 
niques fails to compare new techniques to linear scans. Furthermore, we can 
infer from their data that a linear scan almost always out-performs their tech- 
niques in high dimensionality on the examined data sets. This is unsurprising 
as the workloads used to evaluate these techniques are in the class of “badly 
behaving” workloads identified by our results; the proposed methods may well 
be effective for appropriately chosen workloads, but this is not examined in their 
performance evaluation. 

In summary, our results suggest that more care be taken when thinking of 
nearest neighbor approaches and high dimensional indexing algorithms; we sup- 
plement our theoretical results with experimental data and a careful discussion. 



2 On the Significance of “Nearest Neighbor” 

The NN problem involves determining the point in a data set that is nearest 
to a given query point (see Figure P). It is frequently used in Geographical 
Information Systems (GIS), where points are associated with some geographical 
location (e.g., cities). A typical NN query is: “What city is closest to my current 
location?” 

While it is natural to ask for the nearest neighbor, there is not always a 
meaningful answer. For instance, consider the scenario depicted in Figure El 
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Even though there is a well-defined nearest neighbor, the difference in distance 
between the nearest neighbor and any other point in the data set is very small. 
Since the difference in distance is so small, the utility of the answer in solving 
concrete problems (e.g. minimizing travel cost) is very low. Furthermore, consider 
the scenario where the position of each point is thought to lie in some circle with 
high confidence (see Figure E|). Such a situation can come about either from 
numerical error in calculating the location, or “heuristic error”, which derives 
from the algorithm used to deduce the point (e.g. if a fiat rather than a spherical 
map were used to determine distance). In this scenario, the determination of a 
nearest neighbor is impossible with any reasonable degree of confidence! 

While the scenario depicted in Figure 0 is very contrived for a geographical 
database (and for any practical two dimensional application of NN), we show 
that it is the norm for a broad class of data distributions in high dimensionality. 
To establish this, we will examine the number of points that fall into a query 
sphere enlarged by some factor e (see Figure Ej). If few points fall into this 
enlarged sphere, it means that the data point nearest to the query point is 
separated from the rest of the data in a meaningful way. On the other hand, if 
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Fig. 3. The data points are approximations. Each circle denotes a region where 
the true data point is supposed to be. 




Fig. 4. Illustration of query region and enlarged region. (DMIN is the distance 
to the nearest neighbor, and DMAX to the farthest data point.) 



many (let alone most!) data points fall into this enlarged sphere, differentiating 
the “nearest neighbor” from these other data points is meaningless if e is small. 
We use the notion instability for describing this phenomenon. 

Definition 1 A nearest neighbor query is unstable for a given e if the distance 
from the query point to most data points is less than (1 + e) times the distance 
from the query point to its nearest neighbor. 

We show that in many situations, for any fixed e > 0, as dimensionality rises, 
the probability that a query is unstable converges to 1. Note that the points 
that fall in the enlarged query region are the valid answers to the approximate 
nearest neighbors problem (described in jS|). 

3 NN in High-Dimensional Spaces 

This section contains our formulation of the problem, our formal analysis of the 
effect of dimensionality on the meaning of the result, and some formal implica- 
tions of the result that enhance understanding of our primary result. 
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3.1 Notational Conventions 

We use the following notation in the rest of the paper: 

A vector: x 

Probability of an event e: P [e] . 

Expectation of a random variable X-. E [A]. 

Variance of a random variable X: var(A). 

IID: Independent and identically distributed. 

(This phrase is used with reference to the values assigned to a collection of 
random variables.) 

X ^ F: A random variable X that takes on values following the distribution 
F. 

3.2 Some Results from Probability Theory 

Definition 2 A sequence of random vectors (all vectors have the same arity) 
Ai, A 2 , . . . converges in probability to a constant vector c if for all e > 0 
the probability of Am being at most e away from c converges to 1 as m ^ 00 . 
In other words: 

Ve > 0 , lim P [\\Am — c|j < e] = 1 

m — >00 

We denote this property by Am ->-p c. We also treat random variables that are 
not vectors as vectors with arity 1. 



Lemma 1 If B\, B 2 , ... is a sequence of random variables with finite variance 
and limm— »oo E [Bm] = b and limm_,oo var(Bm) = 0 then Bm — b. 

A version of Slutsky’s theorem Let Ai,A 2 ,... be random variables (or 
vectors) and g be a continuous function. If Am c and g{c) is finite then 
g{Am) 9{^)- 

Corollary 1 (to Slutsky’s theorem) //Ai,A 2 ,... and are se- 

quences or random variables s.t. Xm ->-p a and Ym — 6 yf 0 then XmfYm — 
a)b. 

3.3 Nearest Neighbor Formulation 

Given a data set and a query point, we want to analyze how much the distance 
to the nearest neighbor differs from the distance to other data points. We do 
this by evaluating the number of points that are no farther away than a factor 
e larger than the distance between the query point and the NN, as illustrated 
in Figure 0 When examining this characteristic, we assume nothing about the 
structure of the distance calculation. 

We will study this characteristic by examining the distribution of the dis- 
tance between query points and data points as some variable m changes. Note 
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that eventually, we will interpret m as dimensionality. However, nowhere in the 
following proof do we rely on that interpretation. One can view the proof as 
a convergence condition on a series of distributions (which we happen to call 
distance distributions) that provides us with a tool to talk formally about the 
“dimensionality curse” . 

We now introduce several terms used in stating our result formally. 

Definition 3 : 

m is the variable that our distance distributions may converge under (m ranges 
over all positive integers). 

F datai, F data^T ■ . is a scqucnce of data distributions. 

F query 1, F query 2 , ... is a scqucnce of query distributions, 
n is the (fixed) number of samples (data points) from each distribution. 

Vm ■ ■ ■ 5 are Ti independent data points per m such that Pm.i 

F datoj^ . 

Qm ~ F query.,yi is a query point chosen independently from all Pm,i- 
0 < p < 00 is a constant. 

Vm, dm is a function that takes a data point from the domain of F datum o,nd a 
query point from the domain of F querym and returns a non-negative real number 
as a result. 

DMINm = min {dm{Pm,i, Qm) |1 < j < n}. 

DMAXm = ma,x {dmiPm,^,Qm) |1 < i < n}. 



3.4 Instability Result 

Our main theoretical tool is presented below. In essence, it states that assuming 
the distance distribution behaves a certain way as m increases, the difference in 
distance between the query point and all data points becomes negligible (i.e., the 
query becomes unstable) . Future sections will show that the necessary behavior 
described in this section identifies a large class (larger than any other classes 
we are aware of for which the distance result is either known or can be readily 
inferred from known results) of workloads. More formally, we show: 



Theorem 1 Under the conditions in Definition 0 if 



lim var 

m — »-oo 



/ {dm{Pm,uQm))P \ 
\E[{dm{Pm,l,Qm))P]J 



= 0 



( 1 ) 



Then for every £ > 0 

lim P [DMAXm < (1 + e)DMINm] = 1 

m— ^oo 

Proof Let fj,m = E [{dm{Pm,i, Qm))^]- (Note that the value of this expectation 
is independent of i since all Pm,i have the same distribution.) 

Let Vm — i^dmi^PmPiQrri))^ / Tm- 
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Part 1: We’ll show that Vm 1- 

It follows that E \Vm] = 1 (because Vm is a random variable divided by its 
expectation.) 

Trivially, limm^oo E [Vm] = 1- 

The condition of the theorem (Equation Q) means that lim,„^oo var (V^) = 0. 
This, combined with limm^oo E [Vm] = 1, enables us to use Lemmanjto conclude 
that Vm 1. 

Part 2: We’ll show that if Vm — >p 1 then 

lim P [DMAX™ < (1 + e)DMIN^] = 1. 

m — >-oo 



Let X m — (dm (Pm,l -iQrri} ! , • ■ ■ , , Q m] j (tl vectOr of arity 7z) . 

Since each part of the vector Xm has the same distribution as Vm, it follows 
that Xm ^p (tj • ■ ■ ; t)- 

Since min and max are continuous functions we can conclude from Slutsky’s 
theorem that min(A'm) — >p min(l, . . . , 1) = 1, and similarly, max{Xm) ~>p 1- 
Using CorollaryHon max(Xm) and min(JCm) we get 

max(Xm) ^ 1 _ ^ 
min(Xm) 1 

Note that DMINm = /J.mmin(Xm) and DMAXm = /im niax(Xm)- So, 



DMAXm _ /Tm max(Xm) _ max(Xm) 

DMINm /immin(Xm) min(Xm) 



Therefore, 

DMAXm 

DMINm 

By definition of convergence in probability we have that for all e > 0, 



lim P 

m — *-oo 



DMAXm 

DMINm 



< £ 



= 1 



Also, 



P [DMAXm < (1 + e)DMINm] = P 



DMAXm ^ 
DMINm - 



= P 



DMAXm 

DMINm 



< £ 



(P [DMAXm > DMINm] = 1 so the absolute value in the last term has no effect.) 
Thus, 



lim P [DMAXm < (1 + £)DMINm] = lim P 




DMAXm 

DMINm 



< £ 



= 1 



In summary, the above theorem says that if the precondition holds (i.e., if the 
distance distribution behaves a certain way as m increases), all points converge 
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to the same distance from the query point. Thus, under these conditions, the 
concept of nearest neighbor is no longer meaningful. 

We may be able to use this result by directly showing that Vm — >p 1 and using 
part 2 of the proof. (For example, for IID distributions, Vm 1 follows readily 
from the Weak Law of Large Numbers.) Later sections demonstrate that our 
result provides us with a handy tool for discussing scenarios resistant to analysis 
using law of large numbers arguments. From a more practical standpoint, there 
are two issues that must be addressed to determine the theorem’s impact: 

— How restrictive is the condition 

,. ( {.dm{Pm,l,Qm)Y \ _ 

_ var {{dmjPmA, Qm)Y) _ q 

{E[{dm{Pm,l,Qm)Y]f ~ 

which is necessary for our results to hold? In other words, it says that as 
we increase m and examine the resulting distribution of distances between 
queries and data, the variance of the distance distribution scaled by the 
overall magnitude of the distance converges to 0. To provide a better under- 
standing of the restrictiveness of this condition. Sections 1,4.01 and 01 discuss 
scenarios that do and do not satisfy it. 

— For situations in which the condition is satisfied, at what rate do distances be- 
tween points become indistinct as dimensionality increases? In other words, 
at what dimensionality does the concept of “nearest neighbor” become mean- 
ingless? This issue is more difficult to tackle analytically. We therefore per- 
formed a set of simulations that examine the relationship between m and the 
ratio of minimum and maximum distances with respect to the query point. 
The results of these simulations are presented in Section and in m- 



3.5 Application of Our Theoretical Result 

This section analyses the applicability of Theorem ^ in formally defined situ- 
ations. This is done by determining, for each scenario, whether the condition 
in Equation 0 is satisfied. Due to space considerations, we do not give a proof 
whether the condition in Equation 0 is satisfied or not. contains a full anal- 
ysis of each example. 

All of these scenarios define a workload and use an Lp distance metric over 
multidimensional query and data points with dimensionality m. (This makes the 
data and query points vectors with arity m.) It is important to notice that this is 
the first section to assign a particular meaning to dm (as an Lp distance metric), 
p (as the parameter to Lp), and m (as dimensionality). Theorem0did not make 
use of these particular meanings. 

We explore some scenarios that satisfy Equation 0 and some that do not. We 
start with basic IID assumptions and then relax these assumptions in various 
ways. We start with two “sanity checks” : we show that distances converge with 
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IID dimensions (Example P), and we show that Equation |2] is not satisfied when 
the data and queries fall on a line (Example I3). We then discuss examples involv- 
ing correlated attributes and differing variance between dimensions, to illustrate 
scenarios where the Weak Law of Large Numbers cannot be applied (Examples 

0 El and 0) • 

Example 1 IID Dimensions with Query and Data Independence. 

Assume the following: 

— The data distribution and query distribution are IID in all dimensions. 

— All the appropriate moments are finite (i.e., up to the |"2p]’th moment). 

— The query point is chosen independently of the data points. 

The conditions of Theorem Pare satisfied under these assumptions. While this 
result is not original, it is a nice “sanity check.” (In this very special case we 
can prove Part 1 of Theorem P by using the weak law of large numbers. How- 
ever, this is not true in general.) The assumptions of this example are by no 
means necessary for Theorem Pto be applicable. Throughout this section, there 
are examples of workloads which cannot be discussed using the Weak Law of 
Large Numbers. While there are innumerable slightly stronger versions of the 
Weak Law of Large Numbers, Example P contains an example which meets our 
condition, and for which the Weak Law of Large Numbers is inapplicable. 

Example 2 Identical Dimensions with no Independence. 

We use the same notation as in the previous example. In contrast to the previous 
case, consider the situation where all dimensions of both the query point and 
the data points follow identical distributions, but are completely dependent (i.e., 
value for dimension 1 = value for dimension 2 = . . .). Conceptually, the result 
is a set of data points and a query point on a diagonal line. No matter how 
many dimensions are added, the underlying query can actually be converted to 
a one-dimensional nearest neighbor problem. It is not surprising to find that the 
condition of Theorem Pis not satisfied. 

Example 3 Unique Dimensions with Correlation Between All Dimensions. 

In this example, we intentionally break many assumptions underlying the IID 
case. Not only is every dimension unique, but all dimensions are correlated with 
all other dimensions and the variance of each additional dimension increases. 
The following is a description of the problem. 

We generate an m dimensional data point (or query point) Xm = (Ai,. . ., Xm) 
as follows: 

— First we take independent random variables 
Ui,. . . , Um such that Ui ^ Uniform(0, \/i). 

— We define Xi = Ui. 

— For all 2 < j < m define Xi = Ui + (Ai_i/2). 

The condition of Theorem Pis satisfied. 
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Example 4 Variance Converging to 0. 

This example illustrates that there are workloads that meet the preconditions of 
Theorem n even though the variance of the distance in each added dimension 
converges to 0. One would expect that only some finite number of the earlier 
dimensions would dominate the distance. Again, this is not the case. 

Suppose we choose a point Xm = {X \, . . . , Xm) such that the A^’s are inde- 
pendent and Xi ^ iV(0,l/j). Then the condition of Theorem Q is satisfied. 

Example 5 Marginal Data and Query Distributions Change with Dimension- 
ality. 

In this example, the marginal distributions of data and queries change with di- 
mensionality. Thus, the distance distribution as dimensionality increases cannot 
be described as the distance in a lower dimensionality plus some new component 
from the new dimension. As a result, the weak law of large numbers, which im- 
plicitly is about sums of increasing size, cannot provide insight into the behavior 
of this scenario. The distance distributions must be treated, as our technique 
suggests, as a series of random variables whose variance and expectation can be 
calculated and examined in terms of dimensionality. 

Let the m dimensional data space Sm be the boundary of an m dimensional 
unit hyper-cube, (i.e., Sm = [0,1]"* — (0,1)"*). In addition, let the distribution 
of data points be uniform over Sm ■ In other words, every point in Sm has equal 
probability of being sampled as a data point. Lastly, the distribution of query 
points is identical to the distribution of data points. 

Note that the dimensions are not independent. Even in this case, the condi- 
tion of Theorem 0is satisfied. 

4 Meaningful Applications of High Dimensional NN 

In this section, we place Theorem □ in perspective, and observe that it should 
not be interpreted to mean that high-dimensional NN is never meaningful. We 
do this by identifying scenarios that arise in practice and that are likely to have 
good separation between nearest and farthest neighbors. 

4.1 Classification and Approximate Matching 

To begin with, exact match and approximate match queries can be reasonable. 
For instance, if there is dependence between the query point and the data points 
such that there exists some data point that matches the query point exactly, then 
DMINm = 0. Thus, assuming that most of the data points aren’t duplicates, a 
meaningful answer can be determined. Furthermore, if the problem statement is 
relaxed to require that the query point be within some small distance of a data 
point (instead of being required to be identical to a data point), we can still call 
the query meaningful. Note, however, that staying within the same small distance 
becomes more and more difficult as m increases since we are adding terms to the 
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sum in the distance metric. For this version of the problem to remain meaningful 
as dimensionality increases, the query point must be increasingly closer to some 
data point. 

We can generalize the situation further as follows: The data consists of a set of 
randomly chosen points together with additional points distributed in clusters 
of some radius <5 around one or more of the original points, and the query is 
required to fall within one of the data clusters (see Figure E). This situation 
is the perfectly realized classification problem, where data naturally falls into 
discrete classes or clusters in some potentially high dimensional feature space. 
Figure El depicts a typical distance distribution in such a scenario. There is a 
cluster (the one into which the query point falls) that is closer than the others, 
which are all, more or less, indistinguishable in distance. Indeed, the proper 
response to such a query is to return all points within the closest cluster, not 
just the nearest point (which quickly becomes meaningless compared to other 
points in the cluster as dimensionality increases) . 

Observe however, that if we don’t guarantee that the query point falls within 
some cluster, then the cluster from which the nearest neighbor is chosen is subject 
to the same meaningfulness limitations as the choice of nearest neighbor in the 
original version of the problem; Theorem 0 then applies to the choice of the 
“nearest cluster” . 



4.2 Implicitly Low Dimensionality 

Another possible scenario where high dimensional nearest neighbor queries are 
meaningful occurs when the underlying dimensionality of the data is much lower 
than the actual dimensionality. There has been recent work on identifying these 
situations (e.g. H7TO) and determining the useful dimensions (e.g. IZDI, which 
uses principal component analysis to identify meaningful dimensions) . Of course, 
these techniques are only useful if NN in the underlying dimensionality is mean- 
ingful. 
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distance between two random points 



Fig. 6. Probability density function of distance between random clustered data 
and query points. 



5 Experimental Studies of NN 

Theorem Q only tells us what happens when we take the dimensionality to in- 
finity. In practice, at what dimensionality do we anticipate nearest neighbors to 
become unstable? In other words, Theorem ^ describes some convergence but 
does not tell us the rate of convergence. We addressed this issue through em- 
pirical studies. Due to lack of space, we present only three synthetic workloads 
and one real data set. m includes additional synthetic workloads along with 
workloads over a second real data set. 

We ran experiments with one IID uniform(0,l) workload and two different 
correlated workloads. Figure 0 shows the average DMAXm/DMINm as dimen- 
sionality increases of 1000 query points on synthetic data sets of one million 
tuples. The workload for the “recursive” line (described in Example 0) has 
correlation between every pair of dimensions and every new dimension has a 
larger variance. The “two degrees of freedom” workload generates query and 
data points on a two dimensional plane, and was generated as follows: 

— Let tti, 02 , . . . and 6i, 62, ■ ■ • be constants in (-1,1). 

— Let Ui,U 2 be independent uniform(0,l). 

— For all 1 < j < m let Xi = atUi + biU 2 - 

This last workload does not satisfy Equation El Figure Q shows that the “two 
degrees of freedom” workload behaves similarly to the (one or) two dimensional 
uniform workload, regardless of the dimensionality. However, the recursive work- 
load (as predicted by our theorem) was affected by dimensionality. More in- 
terestingly, even with all the correlation and changing variances, the recursive 
workload behaved almost the same as the IID uniform case! 

This graph demonstrates that our geometric intuition for nearest neighbor, 
which is based on one, two, and three dimensions, fails us at an alarming rate as 
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Fig. 7. Correlated distributions, one million tuples. 



dimensionality increases. The distinction between nearest and farthest points, 
even at ten dimensions, is a tiny fraction of what it is in one, two, or three 
dimensions. For one dimension, DMAXm/DMIN^ for “uniform” is on the order 
of 10^, providing plenty of contrast between the nearest object and the farthest 
object. At 10 dimensions, this contrast is already reduced by 6 orders of mag- 
nitude! By 20 dimensions, the farthest point is only 4 times the distance to the 
closest point. These empirical results suggest that NN can become unstable with 
as few as 10-20 dimensions. 

Figure |S| shows results for experiments done on a real data set. The data 
set was a 256 dimensional color histogram data set (one tuple per image) that 
was reduced to 64 dimensions by principal components analysis. There were 
approximately 13, 500 tuples in the data set. We examine k-NN rather than NN 
because this is the traditional application of image databases. 

To determine the quality of answers for NN queries, we examined the per- 
centage of queries in which at least half the data points were within some factor 
of the nearest neighbor. Examine the graph at median distance/k distance = 3. 
The graph says that for k = 1 (normal NN problem), 15% of the queries had at 
least half the data within a factor of 3 of the distance to the NN. For k = 10, 
50% of the queries had at least half the data within a factor of 3 of the distance 
to the 10th nearest neighbor. It is easy to see that the effect of changing k on 
the quality of the answer is most significant for small values of k. 
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Fig. 8. 64-D color histogram data. 



Does this data set provide meaningful answers to the 1-NN problem? the 10- 
NN problem? the 100-NN problem? Perhaps, but keep in mind that the data set 
was derived using a heuristic that approximates image similarity. Furthermore, 
the nearest neighbor contrast is much lower than our intuition suggests (i.e., 2 
or 3 dimensions) . A careful evaluation of the relevance of the results is definitely 
called for. 



6 Analyzing the Performance of a NN Processing 
Technique 

In this section, we discuss the ramifications of our results when evaluating tech- 
niques to solve the NN problem; in particular, many high-dimensional indexing 
techniques have been motivated by the NN problem. An important point that we 
make is that all future performance evaluations of high dimensional NN queries 
must include a comparison to linear scans as a sanity check. 

First, our results indicate that while there exist situations in which high 
dimensional nearest neighbor queries are meaningful, they are very specific in 
nature and are quite different from the “independent dimensions” basis that most 
studies in the literature (e.g., use to evaluate techniques in a 

controlled manner. In the future, these NN technique evaluations should focus 
on those situations in which the results are meaningful. For instance, answers 
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are meaningful when the data consists of small, well-formed clusters, and the 
query is guaranteed to land in or very near one of these clusters. 

In terms of comparisons between NN techniques, most papers do not compare 
against the trivial linear scan algorithm. Given our results, which argue that in 
many cases, as dimensionality increases, all data becomes equidistant to all other 
data, it is not surprising that in as few as 10 dimensions, linear scan handily 
beats these complicated indexing structures. Q We give a detailed and formal 
discussion of this phenomenon in m 

For instance, the performance study of the parallel solution to the k-nearest 
neighbors problem presented in cm indicates that their solution scales more 
poorly than a parallel scan of the data, and never beats a parallel scan in any 
of the presented data. 

m provides us with information on the performance of both the SS tree and 
the R* tree in finding the 20 nearest neighbors. Conservatively assuming that 
linear scans cost 15% of a random examination of the data pages, linear scan 
outperforms both the SS tree and the R* tree at 10 dimensions in all cases. In 
m, linear scan vastly outperforms the SR tree in all cases in this paper for the 
16 dimensional synthetic data set. For a 16 dimensional real data set, the SR 
tree performs similarly to linear scan in a few experiments, but is usually beaten 
by linear scan. In P], performance numbers are presented for NN queries where 
bounds are imposed on the radius used to find the NN. While the performance in 
high dimensionality looks good in some cases, in trying to duplicate their results 
we found that the radius was such that few, if any, queries returned an answer. 

While performance of these structures in high dimensionality looks very poor, 
it is important to keep in mind that all the reported performance studies exam- 
ined situations in which the distance between the query point and the nearest 
neighbor differed little from the distance to other data points. Ideally, they should 
be evaluated for meaningful workloads. These workloads include low dimensional 
spaces and clustered data/queries as described in SectionEl Some of the existing 
structures may, in fact, work well in appropriate situations. 



7 Related Work 

7.1 The Curse of Dimensionality 

The term dimensionality curse is often used as a vague indication that high 
dimensionality causes problems in some situations. The term was first used by 
Bellman in 1961 [Zj for combinatorial estimation of multivariate functions. An 
example from statistics: in EH] it is used to note that multivariate density esti- 
mation is very problematic in high dimensions. 

^ Linear scan of a set of sequentially arranged disk pages is much faster than unordered 
retrieval of the same pages; so much so that secondary indexes are ignored by query 
optimizers unless the query is estimated to fetch less than 10% of the data pages. 
Fetching a large number of data pages through a multi-dimensional index usually 
results in unordered retrieval. 
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In the area of the nearest neighbors problem it is used for indicating that a 
query processing technique performs worse as the dimensionality increases. In 
m it was observed that in some high dimensional cases, the estimate of NN 
query cost (using some index structure) can be very poor if “boundary effects” 
are not taken into account. The boundary effect is that the query region (i.e., a 
sphere whose center is the query point) is mainly outside the hyper-cubic data 
space. When one does not take into account the boundary effect, the query cost 
estimate can be much higher than the actual cost. The term dimensionality curse 
was also used to describe this phenomenon. 

In this paper, we discuss the meaning of the nearest neighbor query and not 
how to process such a query. Therefore, the term dimensionality curse (as used 
by the NN research community) is only relevant to Section El and not to the 
main results in this paper. 



7.2 Computational Geometry 



The nearest neighbor problem has been studied in computational geometry (e.g.. 
However, the usual approach is to take the number of dimensions 
as a constant and find algorithms that behave well when the number of points is 
large enough. They observe that the problem is hard and define the approximate 
nearest neighbor problem as a weaker problem. In E] there is an algorithm that 
retrieves an approximate nearest neighbor in O(logn) time for any data set. In 
E] there is an algorithm that retrieves the true nearest neighbor in constant 
expected time under the IID dimensions assumption. However, the constants for 
those algorithms are exponential in dimensionality. In |S| they recommend not 
to use the algorithm in more than 12 dimensions. It is impractical to use the 
algorithm in P] when the number of points is much lower than exponential in 
the number of dimensions. 



7.3 Fractal Dimensions 

In it was suggested that real data sets usually have fractal properties 

(self-similarity, in particular) and that fractal dimensionality is a good tool in 
determining the performance of queries over the data set. 

The following example illustrates that the fractal dimensionality of the data 
space from which we sample the data points may not always be a good indicator 
for the utility of nearest neighbor queries. Suppose the data points are sampled 
uniformly from the vertices of the unit hypercube. The data space is 2"* points 
(in m dimensions), so its fractal dimensionality is 0. However, this situation 
is one of the worst cases for nearest neighbor queries. (This is actually the HD 
Bernoulli (1/2) which is even worse than HD uniform.) When the number of data 
points in this scenario is close to 2™, nearest neighbor queries become stable, 
but this is impractical for large to. 

However, are there real data sets for which the (estimated) fractal dimension- 
ality is low, yet there is no separation between nearest and farthest neighbors? 
This is an intriguing question that we intend to explore in future work. 
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We used the technique described in |S] on two real data sets (described in 
d). However, the fractal dimensionality of those data sets could not be esti- 
mated (when we divided the space once in each dimension, most of the data 
points occupied different cells). We used the same technique on an artificial 100 
dimensional data set that has known fractal dimensionality 2 and about the 
same number of points as the real data sets (generated like the “two degrees of 
freedom” workload in Section 0 but with less data) . The estimate we got for 
the fractal dimensionality is 1.6 (which is a good estimate). Our conclusion is 
that the real data sets we used are inherently high dimensional; another possible 
explanation is that they do not exhibit fractal behavior. 

8 Conclusions 

In this paper, we studied the effect of dimensionality on NN queries. In particular, 
we identified a broad class of workloads for which the difference in distance 
between the nearest neighbor and other points in the data set becomes negligible. 
This class of distributions includes distributions typically used to evaluate NN 
processing techniques. Many applications use NN as a heuristic (e.g., feature 
vectors that describe images). In such cases, query instability is an indication 
of a meaningless query. This problem is worsened by the use of techniques that 
provide an approximate nearest neighbor to improve performance. 

To find the dimensionality at which NN breaks down, we performed exten- 
sive simulations. The results indicated that the distinction in distance decreases 
fastest in the first 20 dimensions, quickly reaching a point where the difference 
in distance between a query point and the nearest and farthest data points drops 
below a factor of four. In addition to simulated workloads, we also examined two 
real data sets that behaved similarly (see m)- 

In addition to providing intuition and examples of distributions in that class, 
we also discussed situations in which NN queries do not break down in high 
dimensionality. In particular, the ideal data sets and workloads for classifica- 
tion/clustering algorithms seem reasonable in high dimensionality. However, if 
the scenario is deviated from (for instance, if the query point does not lie in a 
cluster), the queries become meaningless. 

The practical ramifications of this paper are for the following two scenarios: 

Evaluating a NN workload. Make sure that the distance distribution (be- 
tween a random query point and a random data point) allows for enough 
contrast for your application. If the distance to the nearest neighbor is not 
much different from the average distance, the nearest neighbor may not be 
useful (or the most “similar”). 

Evaluating a NN processing technique. When evaluating a NN processing 
technique, test it on meaningful workloads. Examples for such workloads are 
given in Section^ In addition, the evaluation of the technique for a particular 
workload should take into account any approximations that the technique 
uses to improve performance. Also, one should ensure that a new processing 
technique outperforms the most trivial solutions (e.g., sequential scan). 
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Abstract. Partitioning a mnlti-dimensional data set into rectangnlar 
partitions subject to certain constraints is an important problem that 
arises in many database applications, including histogram-based selec- 
tivity estimation, load-balancing, and construction of index structnres. 
While provably optimal and efficient algorithms exist for partitioning 
one-dimensional data, the multi-dimensional problem has received less 
attention, except for a few special cases. As a result, the heuristic parti- 
tioning techniques that are used in practice are not well understood, and 
come with no guarantees on the quality of the solution. In this paper, we 
present algorithmic and complexity-theoretic results for the fundamental 
problem of partitioning a two-dimensional array into rectangular tiles of 
arbitrary size in a way that minimizes the number of tiles required to sat- 
isfy a given constraint. Our main results are approximation algorithms 
for several partitioning problems that provably approximate the optimal 
solutions within small constant factors, and that run in linear or close 
to linear time. We also establish the NP-hardness of several partitioning 
problems, therefore it is unlikely that there are efficient, i.e., polynomial 
time, algorithms for solving these problems exactly. 

We also discuss a few applications in which partitioning problems arise. 
One of the applications is the problem of constructing multi-dimensional 
histograms. Our results, for example, give an efficient algorithm to con- 
struct the V-Optimal histograms which are known to be the most ac- 
curate histograms in several selectivity estimation problems. Our algo- 
rithms are the first to provide guaranteed bounds on the quality of the 
solution. 



1 Introduction 

Many problems arising in databases and other areas require partitioning a multi- 
dimensional data set into rectangular partitions or tiles such that certain math- 
ematical constraints are satisfied. Often these constraints take the form of mini- 
mizing (or maximizing) a metric using a fixed number of partitions or, conversely, 

* This work was done while the author was at Bell Labs. 
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minimizing the number of partitions while not exceeding (or falling below) a 
given value of that metric. 

These problems are quite challenging for most interesting metrics and hence 
one usually resorts to heuristic approaches. Unfortunately, many of these ap- 
proaches do not provide any guarantees on the quality of the solution and may 
thus adversely affect the application. In this paper we present algorithmic and 
complexity-theoretic results on the fundamental problem of partitioning two- 
dimensional arrays into rectangular partitions tile^ of arbitrary sizes. We de- 
velop solutions that offer guarantees on the quality of their solutions and that 
run in small polynomial time (near-linear in most cases). We start out with a 
few examples. 

Example 1. Consider the 4x4 array in Fig. 1(a). A partitioning with 5 tiles is 
shown in Figure 1(5) such that the maximum sum of the elements that fall within 
any one tile is at most 57. There are many different ways to tile the array with 5 
tiles with different maximum sums. There are also alternative ways to evaluate 
the partitioning, other than by considering the maximum sum of the elements. 
For instance, for each tile, we could sum up the squares of the difference between 
each element in the tile and the average of all the elements in that tile; we could 
then total all the values thus obtained for the tiles. This value is 204.7 as shown 
in Fig. 1(5). Again different partitions induce different values. □ 

Example 2. Consider the 4x4 array in Fig. ^ A 3 x 3 tiling, namely one 
obtained by partitioning rows into 3 intervals and columns into 3 intervals, is 
presented in Fig. 1(c). For the partition presented there, the maximum sum of 
the elements that fall within any tile is at most 79. Again different partitions 
induce different values. □ 
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Fig. 1. Partitioning Examples 



Example 1 arises in data partitioning for load balancing m and histogram- 
based selectivity estimation while Example 2 arises in constructing grid- 
files which are well-known index structures m- We will list other application 
scenarios further below, but we briefly describe the histogram context here. 

^ We use the terms tile and partition interchangeably in the rest of the paper. 
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Example Application: (Histograms) Query optimizers require reasonably ac- 
curate estimates of query result sizes in order to estimate the costs of various 
execution plans. Most commercial database management systems use histograms 
to approximate the data in the database in order to perform these estimations. 
Histograms group attribute values into subsets (buckets) and approximate true 
attribute values and their frequencies based on summary statistics maintained 
in each bucket 1201311 . Researchers have proposed the use of multi- dimensional 
histograms for approximating distributions with multiple attributes ITTffT^ . This 
approach involves partitioning the multi-dimensional space of attribute values 
into rectangular buckets based on various partitioning constraints. This leads to 
partitioning problems of the sort we consider in this paper. In particular, the 
well-known R-optimal histogram in two dimensions, which has been shown to 
minimize selectivity estimation errors for estimating the result sizes of several 
classes of queries HM, corresponds to the alternative metric, and one of the 
partitions in Example 1. □ 

There are many different types of partitions and many different metrics to 
evaluate the partitions. A generic optimization problem that arises in several 
application contexts is as follows. 

The Partitioning Problem. We are given a two-dimensional array of elements, 
the type of partition sought, a metric to evaluate the partitions, and a bound 6 . 
The problem is to produce a partitioning of the array of the type sought, with 
the minimal number of rectangular tiles such that the metric computed on the 
partition is at most 6 . □ 

The partitioning problem has been extensively studied in many application 
scenarios in the Database and Algorithms Research communities, in various 
specialized forms. However, most known results are either heuristics without 
provable guarantees, or provably efficient algorithms for simple, specific metrics 
and tilings. For example, little is known about the complexity of the partitioning 
problem with the alternative metric in Example 1 for different types of partitions, 
but this problem is fundamental in histogram construction. 

Motivated by this state-of-the-art, we formulate the partitioning problem in 
its generality and study its difficulty for various types of partitions and metrics 
of relevance in application scenarios. Our contributions are two-fold. 

1. We show that the partitioning problem is NP-hard for many natural metrics 
and partitionings arising in database applications. Thus, efficient, i.e. poly- 
nomial time, algorithms exist if and only if NP = P, a central complexity 
question that remains unresolved. 

The partitioning problem can be naturally defined on one-dimensional arrays 
too. These problems can be solved efficiently in small polynomial time ESI, 
but our claim above implies that their natural two-dimensional variants are 
NP-Hard. Thus, the partitioning problem becomes fundamentally different 
as we go from one-dimensional to multidimensional arrays. 

2. Our main technical results are a number of algorithms that complement the 
negative results above. We present very efficient (near-linear time) algo- 
rithms for approximately solving the partitioning problem in two dimensions 
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for fairly general metrics and different partitionings (including most that 
arise in our database applications) ; the approximation hounds are guaran- 
teed, and they are small constants. 

We define the partitioning problem and state our results more formally in 
Section 0 Here are a few high-level remarks on our results. 

Remark 1. All of our algorithms extend to multiple dimensions with similar 
(i.e., near-linear) performance bounds. However, this may sound better than it 
actually is, because they will take time that is linear in the size of the underlying 
d-dimensional array. However, in many applications, this array may be very 
sparse, and an algorithm that works in time linear in the number of non-zero 
entries would be much preferable, while one that works in time linear in the size 
of the array can be prohibitively expensive. Some of our algorithms can likely 
be modified to exploit sparseness of the input domain, while for others this is 
more difficult. 

Remark 2. There are other limitations to our framework of partitioning prob- 
lems. For example, in some application scenarios, tiles may be allowed to overlap, 
i.e., the problem may be one of covering the array rather than partitioning it. 
This arises, for example, when building certain spatial indices such as i?-trees. 
Our results do not directly apply to such covering problems. Also, there are ap- 
plication scenarios where one may be allowed to permute the rows and columns 
of the input array. All our results assume that some canonical ordering (such as 
that given by the natural order of numeric attributes in database applications) 
is fixed. 

Remark 3. Despite the limitations above, the partitioning problems we study 
are very general, and they arise in a number of important application scenarios 
within databases. Besides the problems of histogram-based selectivity estima- 
tion, grid file construction, and load balancing mentioned earlier, there are ap- 
plications in database compression, bulk-loading of hierarchical index structures, 
and partitioning of spatial data, as well as to problems outside databases such 
as domain partitioning in scientific computation, support of data partitioning 
in data-parallel languages, and image and video processing. In this paper, we 
clarify applications to histogram-based selectivity estimation only; discussions 
about other applications can be found in the full version of this paper. 

□ 

Map. The rest of the paper is organized as follows. We formalize the partition- 
ing problem and relate it to various application contexts in Section E| we also 
state our results formally there. In Sections El O and 0 we present hardness 
and algorithmic results for different types of partitions with different metrics. 
In Section Q we discuss the implications of our results for one applied area in 
databases, namely, histogram-based selectivity estimation. Section 0 has con- 
cluding remarks. 
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2 Problem Formulation and Overview of Resnlts 

2.1 The Partitioning Problem Definition 

We are given an n x n array A containing N = n? real numbers. A tile is any 
rectangular subarray A[i . . ,j,k . . .1]. A partitioning of array A is a partitioning 
of A into tiles; by definition of a partition, each element lies within some 

tile and no two tiles overlap. As mentioned in the previous section, partitioning 
schemes can be classified based on the type of partition and the metric functions 
used within and outside the tiles to evaluate the partition. We define these 
criteria below and present interesting instantiations for each of them. 

Type of Partitioning: There are many possible types of partitionings of a 
two-dimensional array. The ones we consider here are the common ones that 
arise in database applications: 

1. Arbitrary: No restrictions on the arrangement of tiles within A (Figure l^ja). 

2. Hierarchical: A hierarchical partition is one in which there exists a vertical 
or horizontal separation of array into two disjoint parts in each of which the 
partitioning is again hierarchical (Figure 0b). A hierarchical partitioning is 
naturally represented by a hierarchy tree which is a binary tree in which a 
node represents a subarray of A and each of its children represent a partition 
of that subarray of A into two disjoint parts; the root represents A. 

3. pxp: Here, the rows and columns of A are partitioned into p disjoint intervals 
each; the induced p^ tiles form the pxp partitioning. This can be thought of 
as a special case of hierarchical partitioning where the tiling in the subarrays 
of two sibling nodes of the hierarchy tree are the same along one dimension 
(Figure 0c). 



a. Arbitrary 





c. pXp 



Fig. 2. Tilings 



Quality Metrics: Metrics are defined on the tiling using the following three 
functions. 

1. Elementary Function: An elementary function h maps the array elements of 
any tile to real numbers. Common functions include (f) ID, i.e., ft.(A[i]) = 
A[i], (a) AVG-DIFF, i.e., h{A[i]) = \A[i] — A\ where A is the average of the 
elements in that tile, (Hi) GEOJDIFF, i.e., h{A[i]) = \A[i] — A'\ where A' is 
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the geometric median of the elements in the tile, and {iv) SQRJ3IFF which 
is the square of AVG_DIFF. 

2. Heft Function. A heft function g is defined for a tile r and an elementary func- 

tion h of the elements in that tile. The common heft functions are (i) SUM, 
i-e-> {ii) MAX, i.e., maxA[ij]Gr j]), {in) RATIO, 

i.e., etc. SUM and MAX are the common heft functions 

in our application scenarios. Notice that the heft is an “intra-tile” function. 
The value evaluated by the heft function in a tile is called its heft. 

3. Cumulative Function. A cumulative function / is defined for the entire set 
of tiles, for combining heft function g. The common cumulative functions 
are SUM and MAX, representing the total and maximum heft value of tiles 
in the given tiling, respectively. Note that the cumulative function is an 
“inter-tile” function. 

Any partition thus has a metric that is a combination of these three functions 
f — g — h. For example, the SUM-MAX-ID metric of a partition is the sum over 
the tiles of the maximum element in each tile. In Section Q Example 1 has 
the MAX-SUM-ID metric with the alternative being the SUM-SUM-SQR_DIFF 
metric, and Example 2 has the MAX-SUM-ID metric. Some combination of the 
functions are trivial - for example, all partitions are identical under the MAX- 
MAX-ID and SUM-SUM-ID metrics - but most of the metrics are nontrivial. 
The value the metric evaluates to on a given array and a partition, is called its 
metric value., or where there is no ambiguity, the heft of the partition. 

The Optimization Problem: The optimization problem we consider is as 
follows: For the given type of partition, metric, and bound 6, determine the par- 
titioning of that type that has the minimum number of tiles with the metric value 
of the partition being at most 6. A related problem is one in which we are given 
a bound p on the number of tiles and the goal is to determine a tiling using at 
most p tiles with minimum metric value. As far as the NP-Hardness is concerned, 
these two optimization problems are identical, that is, if one is NP-hard, so is 
the other. However, this similarity does not extend to efficient approximability 
of these problems. In this paper, we only consider the first version. 

Several partitioning problems arising in real applications mentioned in the 
Introduction fit naturally in our framework. We present one such application 
(histograms) in full detail in Section 0 and illustrate how our problem definition 
and results solve an important problem arising in that application. 



2.2 Some Preliminaries 

We state some properties of the metrics and hefts that will be used latter. We 
say that a heft function g is monotonic if g{r) < g{R) for two tiles r and R, 
r G R. For example, SUM-ID is monotonic provided the array elements are non- 
negative. Most of the heft functions that arise in our applications are monotonic. 
We say that a metric f — g is superadditive if the following holds. f{g{r IJ ^ 
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f{g{r),g{R)), where r and R are any two disjoint tiles. For example, SUM-SUM- 
SQRJDIFF is superadditive (this requires a nontrivial proof). Some of our results 
hold only for superadditive metrics. 

Our algorithms often identify tiles and are required to compute their hefts. 
Say tq represents the time to determine the heft of any tile. The straightforward 
way would be to consider each element of the tile and this takes tq = 0{N) 
time in the worst case. However, there is a more efficient method in some cases. 
Consider the MAX heft function: after 0{N) time preprocessing, we can compute 
the heft of any tile in tq = 0(1) time. We omit the details of this procedure here. 
For SUM heft functions, the same holds for some elementary functions such as 
the SQRJDIFF and ID, and not for others, such as the AVGJDIFF, GEOJDIFF 
etc. There are still other examples of heft functions for which tq is 0{logN). 
In what follows, we state all our bounds in terms of tq. For applications arising 
in databases, tq is often 0(1) since SUM-SQRJDIFF is the most common heft 
function (such as in U-optimal histograms). 



3 Related Work 



Partitioning problems have been studied extensively in various application areas 
including databases, parallel computing (e.g., load balancing), computational 
geometry (e.g., clustering), video compression (e.g., block matching) etc. Some 
related papers from a variety of application areas include 
Here we review a selection of related work most relevant to us. 



wimimmv iisiiii 



Hardness Results. Hardness results exist only for a simple metric function, 
namely, MAX-SUM-ID - [IHl proved it to be NP-hard for arbitrary partitions, 
and Hug proved it to be NP-hard for px p partition. Our NP-hardness results 
are inspired by the abovementioned results. However, the basic gadgets in our 
reductions are different from the ones in and we derive different non- 

approximability bounds for various metrics. 



Algorithmic Results. Dynamic programming has been used to find optimal re- 
sults for hierarchical partitions in several contexts pm. However, our sparse 
hierarchy approach with provable guarantee appears to be new. For p x p par- 
tition MAX-SUM-ID, the best known approximation is by O(logn) factor [TTlj . 
We derived substantially improved bounds for this problem. Many heuristic al- 
gorithms have been proposed for this problem m- 



Applicatious: We focus on prior work in one database application, namely, 
histograms. Histograms have been studied quite extensively in the literature in 
the context of approximating single attributes y21 1 ,41 1 4141 )l,4 1 1 1 . On the 

other hand, these has been very little work on multi-dimensional histograms. 
Muralikrishna et al proposed a heuristic hierarchical algorithm for constructing 
multi-dimensional equidepth histograms PZ]. Poosala et al extended the defini- 
tion of multi-dimensional histograms to other bucketization techniques, such as 
V-Optimal, MaxDiff etc (23 and provided a more sophisticated and general hi- 
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erarchical partitioning algorithm called MHIST. But, neither of these algorithms 
provide any guarantee on the quality of the partitioning. 



4 Hierarchical Tilings 

In this section, we consider the hierarchical partitioning problems. Recall that 
the problem is to produce a hierarchical partition of an n x n array A with 
metric value at most S using the minimum number of tiles. We set N = 
throughout. Let Iq be an upper bound on the time taken to calculate the heft 
value of any tile in A; all our bounds below will be in terms of tg. First we focus 
on exact algorithms, and then approximate ones. We state our results for only 
SUM and MAX metrics, but they hold for any superadditive metric with time 
bound identical to that for the SUM metric. 



4.1 Exact Algorithms 

Theorem 1. For cumulative function MAX, there exists an + N’^tg) 

time algorithm to solve the hierarchical partitioning problem exactly. For cumu- 
lative function SUM, there exists an 0{N‘^'^{B*)‘^tQ) time algorithm to solve the 
hierarchical partitioning problem exactly; here, B* is the number of tiles in the 
optimum solution. 



Proof. The proof uses dynamic programming; it is simple, and it has appeared, 
for example, in m, for a special case of the heft function. We include it here 
since some of our algorithms will be derived by suitably modifying this solu- 
tion. We consider the cumulative function MAX only; the SUM case is a simple 
modification. Define E*{i ■ ■ ■ j,k ■ ■ ■ £) to be smallest number of tiles needed to 
partition the region A[i ■ ■ ■ j,k - ■ ■ F\ with metric value at most <5. If the heft of 
the tile A\i . . . j, fc . . . £] is at most 5, we have E*{i ■ ■ ■ j,k ■ ■ ■ i) = 1. Else, we have 
Equation (I) below: 



E*{f-j,k---i) 



min 

i<x<j, k<y<i 



E*{i ■ ■ ■ X, k ■ ■ ■ £) 
+E*{x+l---j,k---£), 
E*{i---j,k---y) 
+E*{f-j,y + l---£) 



We need to calculate E* = E* {1 ■■■ n, 1 ■■■ n) . We use the dynamic programming 
technique and calculate E*{i - ■■ j,k ...£) for all 1 < i < j < n and 1 < k < £ < 
n. In all, there are O(n^) possible sub-rectangles and for each we calculate 0{n) 
values and the heft of a single tile. Thus we take 0(n® -I- nHq) time which is 
0{N^'^ N’^tq) time. □ 



4.2 Approximate Algorithms 

In this section, we present approximation algorithms for computing hierarchical 
partitions for the SUM and MAX metrics (as before, they can be modified to 



244 



S. Muthukrishnan, Viswanath Poosala, and Torsten Suel 



work for any superadditive metric); the algorithms in this section are faster than 
the exact ones in Section o Our algorithms rely on two natural, well-known 
ideas, namely, rounding, and pruning, which we informally describe below. These 
strategies “sparsify” the dynamic programming, that is, reduce the number of 
subproblems to be considered; this results in a speedier dynamic programming 
solution. They are easy to implement, and the technical crux is their analyses. 

Rounding is a structured way to limit ourselves to only a subset of all possi- 
ble tiles. This is achieved by considering only tiles with endpoints at multiples 
of a small number of parameters. However, there are many ways to perform 
rounding: round only along one of the dimensions, or possibly both; round so 
both the endpoints of tiles are multiples of the parameters, or just one of them; 
etc. Rounding can also be done hierarchically with choice of parameter values 
and rounding techniques at different levels. A general tradeoff for solving the 
partitioning problems in terms of all these combinations is difficult to state, but 
in our solutions, we employ interesting combinations and obtain our bounds. 

Rounding to one grid pattern. We fix a parameter L and assume n is a 
multiple of L - the following description can be easily modified to handle other 
values of n. We define an L-grid as the subset of columns and rows numbered 
1,L -|- 1,2L -|- 1, • • •; thus there are n/L grid columns and rows in an L-grid. 
We refer to each such row (column) as the L-row {L-column respectively). We 
define the 1-hierarchical partition to be a hierarchical partition in which the tiles 
additionally satisfy the following two conditions. (1) At least one side has both 
its endpoints on L-rows on L-columns, that is, the side is of length a multiple of 
L, and (2) Both its sides have their endpoints between two consecutive L-rows 
or two consecutive L-columns; that is, they are each of length at most L. The 
1-hierarchical partitioning problem is to produce a 1-hierarchical partition of 
an n X n array A with metric value at most 6 using the minimum number of 
tiles. In what follows we will argue the following two points: (1) the optimal 
1-hierarchical partition can be determined efficiently, and (2) the optimal 1- 
hierarchical partition approximates the hierarchical partition nicely. 

Lemma 1. For cumulative function MAX, there exists an O^N^'^Iq) time al- 
gorithm to solve the 1-hierarchical partitioning problem exactly. For cumulative 
function SUM, there exists an (B*)^tQ) time algorithm to solve the 1- 

hierarchical partitioning problem exactly; here, B* is the number of tiles in the 
optimum solution. 

Proof. We will only show the proof for the MAX function; the proof for the 
SUM function is similar. Define E* {i ■ ■ ■ j — 1, k ■ ■ ■ i — 1) to be smallest number 
of tiles in an 1-hierarchical partition of the region A[i ■ ■ ■ j — l,k ■ ■ X — 1\ with 
metric value at most 5. We employ the Equation (I) to compute E*{i ■ ■ ■ j,k ■ ■ ■ t), 
with some changes in the set of values taken by x and y in Equation (I). If 
j — i > L, then x takes only the values of L-rows in Equation (I) between i and 
j. otherwise, x takes all values between i and j; same holds for y values too. As 
before, the computation is done using the dynamic programming technique to 
calculate E* = E*{1 ■ ■ ■ n,l ■ ■ ■ n). By choosing L = (optimized based on 
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the running time calculations), the total running time of this algorithm turns 
out to be 0{N^'^tQ). □ 

Lemma 2. Consider any hierarchical partition of array A with B tiles and met- 
ric value at most 6, for an superadditive metric. There exists a 1 -hierarchical 
partition using at most 9B tiles with metric value at most 6 for any positive 
integer L. 

The proof of this lemma is omitted for space constraints. The two preceding 
lemmas together let us conclude that, 

Theorem 2. For cumulative function MAX, there exists an 0{N^'^tQ) time 
algorithm for solving the hierarchical partitioning problem that returns a hier- 
archical partition with at most 9B* tiles and metric value at most S; here, B* 
is the minimum number of tiles in a hierarchical partition with metric value at 
most 6 and the metric is superadditive. For the cumulative function SUM, the 
same holds with running time 0{N^-^{B*)'^tQ); here, B* is the number of tiles 
in the optimum solution. 

This algorithm is faster than the one in Theorem ^ but produces a parti- 
tioning with a slightly larger number of tiles. We can improve the running time 
further by increasing the number of tiles in a structured manner as described 
below. 

Rounding to several grid patterns. We are given a sequence of positive 
integers Lq, Li, L 2 , . . . Lk such that Lq = n, Lk = 1 and Li+i divides Li for 
* > 0. We define a k -hierarchical partition as follows. It is a hierarchical partition 
in which there are k sets Si of permissible intervals that can be the side lengths 
of the tiles, comprises of intervals [jLi,£Li] for some integers j < £. In 
general. Si for 1 < i < k, comprises the set of intervals [jLi,£Li] such that for 
some h, hLi-\ < jLi and {h -\- l)Li_i > £Li. The set Sk comprises intervals in 
Si, for i = k, as defined above, but additionally, all subintervals thereof. The 
^-hierarchical partitioning problem is to find the fc-hierarchical partition with 
metric value at most 6 and minimum number of tiles. We can solve this problem 
again using dynamic programming, although the solution is somewhat more 
involved. The analysis of the running time of this algorithm follows the same 
line as in LemmaH We choose Lfs appropriately to minimize the running time; 
it turns out to be a geometric progression of values. Also, we claim: Consider 
any hierarchical partition of array A with B tiles and cumulative metric value 
at most S, for any superadditive metric. There exists a k -hierarchical partition 
using at most (2k -\- 1)^B tiles with cumulative metric value at most S for any 
positive integer k. The proof is similar to that of Lemma|21 This lets us conclude. 

Theorem 3. For cumulative function MAX, there exists an O(N^'^'^tQ) time al- 
gorithm for solving the hierarchical partitioning problem with at most R*0(l/e^) 
tiles with metric value at most S; here, B* is the the minimum number of tiles 
in a hierarchical partition with metric value at most 6, e is any chosen positive 
fraction, and the cumulative metric is superadditive. 



246 



S. Muthukrishnan, Viswanath Poosala, and Torsten Suel 



By choosing e appropriately, we can get the running time to be as close to 
0{NtQ) as we desire (which is linear for our applications - See Section This 
is achieved at the expense of more (though not a very large number of) tiles. 
A different result which may be of interest is the following: we can prove that 
there is an 0{N time algorithm that approximates the minimum number 
of buckets to 0((loglogn)^) factor, for any superadditive metric. 

The other natural idea we explore for approximate algorithms is pruning. 
Pruning also limits the set of tiles that are examined in dynamic program- 
ming. However, this is data-dependent since, effectively, we do not examine those 
tiles for which the heft is beyond certain prune condition. There is no efficient 
pruning strategy without rounding, since there are many large tiles that can- 
not be pruned. We have different pruning conditions for MAX and SUM. Due 
to space constraints, we omit the description of these conditions which can be 
found in m- Two examples of the results we obtain for MAX metrics are: an 
Q(jyi.25(^* js) algorithm for factor 9 approximation, and 0{N{B*)^) time 

algorithm for factor 25 approximation. Further results for MAX, SUM and other 
superadditive metrics can be found in 1291 . 



5 Arbitrary Partitionings 

5.1 NP-Hardness Results 

In this subsection, we prove NP-hardness results for several metrics that show 
that minimizing the partitioning with p tiles that minimizes the heft is NP-hard. 
In fact, the proof also implies limits on the approximability of the problem for 
some cases. 

For the special case of the MAX-SUM-ID metric, it was shown in m that 
the minimum heft cannot be approximated to within a factor of 1.25. We es- 
tablish similar results for a a different set of metrics that includes SUM-SUM- 
SQR_DIFF, MAX-MAX-AVGJIIFF, and MAX-MAX-GEOJIIFF. As in HH) 
and the earlier work in P], the proof is based on a reduction from the Planar 
3 SAT problem (shown to be NP-complete in |U), though a number of changes 
are needed to adapt the argument to our types of metrics. Similar results can 
also be shown for several other metrics, but we restrict ourselves to the most 
important ones (proofs are omitted here). 

Theorem 4. Given a data distribution A and an upper bound on p, it is NP- 
hard 

— to find the minimum heft of any rectangular partitioning with p tiles under 
the SUM-SUM-SQRJDIFF metric, and 

— to approximate the minimum heft of any rectangular partitioning with p tiles 
under the MAX-MAX-GEO-DIFF metric to any factor less than 2, and 

— to approximate the minimum heft of any rectangular partitioning with p tiles 
under the MAX-MAX- AVG-DIFF metric to any factor less than 3/2. 
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5.2 Approximate Algorithms 

In view of the hardness results in Section we can not anticipate efficient, 
that is, polynomial time algorithms for exactly solving the partitioning problems 
with arbitrary partitions for many natural metrics. In this section, we focus on 
developing efficient approximate algorithms instead. All our approximations are 
based on the following observation which was presented in UHl for a special 
metric. 

Lemma 3. Consider any arbitrary partitioning of a two dimensional array A 
with superadditive cumulative metric at most 6 and B tiles. There exists a hier- 
archical partition of A with cumulative metric at most 6 and at most 4B tiles. 

Proof. It is shown in P] that any arbitrary rectangular partition can be con- 
verted into a hierarchical one by splitting each tile into at most 4 disjoint tiles. 
If we apply their procedure to the given arbitrary partition, the resulting hi- 
erarchical partition has at most 4S tiles; furthermore, the metric value of this 
hierarchical partition is at most 6 by the superadditivity of the metric. □ 

Using this observation with the results in Section 01 we get the following. 
(Similar results can be obtained for the SUM cumulative function as well.) 

Theorem 5. For cumulative function MAX, say B* is the optimal solution to 
the arbitrary partitioning problem with metric value at most 6. There is an al- 
gorithm that finds a partition with metric value at most S in 

1. -\- N'^T) time; the solution has at most AB* tiles. 

2. 0{N^~^'^T) time; the solution has at most AB*0{l/e^) tiles. 

Again, by setting e appropriately, we can obtain an algorithm with near linear 
running time. 

Note. Using the ideas in HH], an 0{N‘^) time algorithm can be obtained that 
approximates the arbitrary partition using at most twice as many buckets as the 
optimum; however, the c is rather large (at least 5), and the resulting algorithm 
is impractical for all but tiny values of N . 

6 pXp Partitioning Schemes 

In this section, we consider the p x p-partitioning problem defined earlier. Thus, 
we are given a two-dimensional distribution A, a metric E and a value 5, and 
we are interested in finding a minimum p, and a, p x p partitioning H, such 
that E{H) < 5. Here, a p x p partitioning is determined by a set of horizontal 
dividers (rows) ho = 0 < hi <...< hp = n and a set of vertical dividers 
(columns) vq = 0 < vi < . . . < Vp = n, and tile Vij of the partition consists of 
all entries A[/c,^] such that /ii_i < k < hi and Vj-i < I <Vj. 

We describe algorithms that run in linear or nearly linear time and that 
compute solutions that are guaranteed to be within a small constant factor of 
optimal. The algorithms provide an interesting application of the framework for 
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approximating Set Cover for set systems with bounded Vapnik-Chervonenkis 
(VC) dimension described by Bronnimann and Goodrich 0 (see also the discus- 
sion in Section lO l 



6.1 NP-Hardness Results 

For the special case of the MAX-SUM-ID metric, Charikar, Chekuri, Feder, and 
Motwani jS| have shown that it is NP-hard to approximate the minimum heft 
of any p x p partitioning to within a factor of less than 2. This result can be 
extended to several other interesting metrics, including SUM-SUM-SQR_DIFF, 
MAX-MAX-AVG_DIFF, and MAX-MAX-GEO_DIFF. We point out that all the 
results are obtained by modifications of the hardness proof in 0, which uses a 
reduction from the k-Balanced Bipartite Vertex Cover (fc-BBVC) problem. The 
results are summarized in the following theorem, the proof of which is omitted 
for space constraints. 

Theorem 6. Given a data distribution A and an upper bound on p, it is NP- 
hard 

— to approximate the minimum heft of any pxp partitioning under the MAX- 
MAX- AVG-DIFF and MAX-MAX-GEO -DIFF metrics to any factor less 
than 2, and 

— to find the minimum heft of any pxp partitioning under the SUM-SUM- 
SQR-DIFF metric. 

The proof of the first claim follows from some fairly simple modifications 
of the proof in jS|, while the second claim requires an additional accounting 
argument. Of course, the result also implies the NP-hardness of the problem of 
minimizing p given an upper bound on the heft, though we do not have any 
inapproximability result for that case. 



6.2 Preliminaries 

We denote by X the set of n rows and n columns of the n x n distribution A. 
As before, we assume N = . For each tile oi a. p x p partitioning H, we 

define a corresponding subset Rij of X consisting of all rows and columns that 
intersect ri^j, except for the last intersecting row and column. We also use a 
weight function w, to be defined later, that assigns a real- valued weight w{x) to 
each X G X, and define w{Y) = subset Y of X. 

Definition 1. Given a weight function w, we say that a p x p partitioning is 
a-good if every tile rij of FI satisfies w{Rij) < a ■ w{X). 

We remark that our a-good partitionings correspond to the e-nets used in 
P] and originally introduced in P2|, which have found many applications in 
computational geometry. 
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6.3 Upper Bounds for MAX Metrics 

We now present approximation results for the p x p-partitioning problem where 
the cumulative metric is the MAX metric, i.e., the heft of a partition is the largest 
heft of any of the tiles. We also require that the heft function is monotonic, i.e., 
the heft of a tile does not decrease if we grow its size, which is true for most 
interesting case^. 

The Algorithm Suppose that we are given a maximum value S for the heft of 
the solution, and assume that there exists a, po x po partitioning Hq with heft at 
most S (the value of po will be guessed using binary search) . We will show that 
the following algorithm computes a p x p partitioning with heft at most S and 
p < (2 + e) • Po, for any chosen e > 0. 

Algorithm MAX-pxp. 

(1) Set the weights of all elements of A to 1. 

(2) Repeat the following three steps: 

(a) Compute an a-good partitioning H , for a = 

(b) Find a tile rij in iJ such that w{Rij) > 6. If none exists, terminate and 
return H as solution. 

(c) Multiply the weights of all elements of X that are contained in Ri j by 

/?=(! + e/2). 

Analysis of the Algorithm. We now analyze the performance of the algorithm 
in three steps: (1) We show how Step (2a) can be implemented, and bound the 
size of the resulting a-good partitioning, (2) we bound the number of iterations 
in Step (2), and (3) we analyze the running time of each iteration. Theorem [3 
then gives the main result of this subsection. 

Lemma 4. There exists an a-good p x p partitioning with p = 1/a. R can he 
computed in time 0{n). 

Proof: Simply set 1/a — 1 horizontal dividers one after the other, starting at 
the top, and repeatedly choosing the next divider hi as the first element where 
the sum of the weights of all rows encountered after ht-i surpasses a • w{Xh), 
where Xh is the set of all rows. Choose the vertical dividers Vi in an analogous 
fashion. □ 

Lemma 5. The loop in Step (2) o/ MAX-pxp terminates after 0(p log n) iter- 
ations. 

Proof: The proof is similar to that of Lemma 3.4 of 0, which itself follows the 
arguments in mm- 

Note that the weight w{X) is initially 2n, and that it increases by at most 
a factor of ^1 -I- 2.(2+i)-po ) s^ich iteration, since in each iteration we multiply 

^ An exception is the MAX-MAX- AVGJDIFF metric. 
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the weights of exactly one of the sets Rij by a factor of (1 + e/2), and this 
set Rij has a total weight of at most ( 2 +e).po ' definition of 

a-goodness. Thus, after k iterations, we can upperbound w{X) by 



2n 




e 

2 • (2 + e) • pq 



< 2n • exp 



e • k 

2 • (2 + e) • po 



exp 



ek 

2- (2 + e)-o 



+ ln(2n] 



where exp{) denotes the exponential function with basis e. We now consider the 
weight w{Hq) of the po x po partitioning Hq of heft at most 6 that we assume 
to existB Note that any tile Vij that is selected in Step (2b) has a heft larger 
than S, and hence Hq must cut , due to the monotonicity of the heft function. 
This implies that at least one element of Hq is also contained in Ri j and has its 
weight increased by a factor of (1 + e/2). Thus, we have 



w{Hq) = ^ (1 + e/2Y' 

XiGHo 



with ^ Zi >= A:, where Zi denotes the number of times the weight of the corre- 
spending element Xi G Hq has been multiplied by (1 + e/2). Using the convexity 
of the exponential function with basis (1 + e/2), we can lower-bound this as 

w{Ho) > 2po ■ (1 -k e/2)'=/(^P°) = exp + ln(2po)^ . 



Since Hq is a subset of X, we must have w{Hq) < w{X), which implies that 



ln{l + e/2) • k 



2po 



— + ln{2po) < ^ + ln{2n), 

2-{2 + e)-pQ 



which, using the inequality ln{l + e/2) > for 0 < e < 1, can be shown to 
imply that 



k <po 



ln{n/po) 

ln{l + e/2)~ 2 ^' 



□ 



Lemma 6. Each iteration in Step (2) runs in time 0{n + p^ Hq), where tq is 
the time needed to compute the heft of a tile. 

Proof: Steps (2a) and (2c) clearly run in time 0{n). In Step (2b), we have to 
compute the heft of at most p^ tiles to find a tile with heft more than 6. □ 

Theorem 7. For any 6 and any e > 0, a p x p partitioning H with heft at most 
S and p < {‘2 + e)po can be computed in time 0{{n + p^ • tq) ■ plogn), where po 
is the minimum number such that there exists a po x pq partitioning with heft at 
most 6, and tq is the time needed to compute the heft of any tile. 

® The weight of Hq is the sum of the weights of the rows and columns that are hori- 
zontal dividers hi or vertical dividers Vi, respectively, of Hq. 
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Proof: We perform binary search for the value of poi starting at po = 2. We 
run algorithm MAX-pxp for each po in the search. If the algorithm does not 
terminate after the number of iterations stated in Lemma 0, then we know that 
there is no po x po partitioning with heft at most 5, and we increase po by some 
small factor (1 + e'). The total running time is dominated by the time used to 
run MAX-pxp on the largest po; this implies the stated bound. □ 

As explained before, in many cases the heft of each tile can be computed in 
0(1) time, by performing 0{N) steps of preprocessing. In particular, this is true 
for the case of the MAX-SUM-ID metric, which is probably the most important 
of the metrics covered by Theorem Q, and we get the following corollary. Note 
that for the common case of p <IC ^/N , this gives a linear time bound. 

Corollary 1. For the MAX-SUM-ID metric, a p x p partitioning H with heft 
at most 6 and p < (2 -|- e)po can be computed in time 0{N -\- p^ log N), where po 
is the minimum number such that there exists a po x po partitioning with heft at 
most S. 



Discussion. We now discuss the relation of our algorithm to the work in uni 
and 0. A simple reduction of the p x p partitioning problem to the Set Cover 
problem was given in m, resulting in an approximation ratio of 0(log N) using 
the well known greedy algorithm for Set Cover. This bound can be improved 
to O(logp) by using the algorithm for approximating Set Cover for the case of 
bounded VC-dimension in |3|, and observing that the set system generated by 
the reduction in uni has VC-dimension 4. By additionally using a construction 
of an e-net for this set system along the lines of our Lemma 0 one can obtain 
an approximation ratio of 16 and a running time of ■ p log TV). In order 

to get near-linear running time, we describe a modified algorithm that operates 
directly on the data distribution without materializing the set system used for 
the reduction to Set Cover, which could be of size in the worst case. 

The approximation ratio of (2 -|- e) is then obtained by tightening the analysis 
of 0 in several places. 

Approximating the Error. For the important special case of the MAX-SUM- 
ID metric, which arises when partitioning data or work evenly among the tiles, 
we can also get significantly improved results for the problem of approximating 
the minimal heft of any p x p partitioning given an upper bound on p. The best 
previous result in m achieved a running time of 0{N'^) and an approximation 
ratio of around 120. We can show the following results. 

Theorem 8. Let Sq be the minimum heft of any po x po partitioning. Then in 
time 0{N -\- p^logN), we can compute 

(a) a p X p partitioning with heft 6 < 4(5q and p < (| -I- e)po, and 

(b) a p X p partitioning with heft 6 < 2(5q and p < (1 -I- e)po, 



for any chosen e > 0. 
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Note that these results come quite close to the lower bound of 2 on the 
approximability shown by Charikar, Chekuri, Feder, and Motwani These 
results are based on a fairly simple observation: If we modify Algorithm MAX- 
pxp such that in Step (2b) we search for a tile with heft at least 2So (4<5o), then 
we can conclude that the optimum solution Hq must cut this tile at least 2 
(resp., 3) times in order to get a heft of at most Jq. This means that Step (2c) 
guarantees a larger increase in the weight of Hq. We can then adjust the choice 
of the other parameters a and /3 appropriately to get the result. 

Other Extensions. All results in this subsection can also be easily extended 
to p X q partitionings with p ^ q, and to non-square input distributions. In 
particular, in the case of a p x g partitioning of an n x m data distribution, 
the running time becomes 0{n + m + pq ■ tQ){p + q)^ log(n -|- m)), while the 
approximation ratio remains as before. It is also easy to extend the techniques 
to d dimensions, resulting in an approximation ratio of d -|- e and a running time 
of 0{{n + p‘^ ■ tQ)plogN) for the result in Theorem^ 

Input data in higher dimensions is usually sparse, and thus efficiency crucially 
depends on exploiting this sparseness. For the algorithms in this subsection, 
the easiest solution would be to implement the computation of the heft of a 
tile (represented by the term tg) in a way that exploits sparseness; the details 
depend on the particular metric. 

6.4 Upper Bounds for SUM Metrics 

We now present approximation results for the case where the cumulative metric 
is the SUM metric. We again require that the heft function is monotonic. The 
algorithm follows the approach from the previous subsection. In contrast to the 
MAX case, we are not aware of any direct reduction of the SUM case to the Set 
Cover problem, and thus it is surprising that the same approach applies. 

The Algorithm Suppose that we are given a maximum value Sq for the heft of 
the solution, and assume that there exists a po x po partitioning ffg with heft at 
most So . Then the following algorithm computes a p x p partitioning with heft 
at most 25o and p < (4 -I- e) • po, for any chosen e > 0. 

Algorithm SUM-pxp. 

(1) Set the weights of all elements of A to 1. 

(2) Repeat the following three steps: 

(a) Compute an a-good partitioning id, for a = 

(b) If the heft of the partitioning is at most 2So, terminate and return H as 
solution. Otherwise, select a tile Vij at random such that the probability 
of picking a tile is proportional to its heft. 

(c) Multiply the weights of all elements of X that are contained in Ri j by 

/?=(! + e/2). 

Sketch of Analysis. The following lemma provides the main insight underlying 
the analysis. 
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Lemma 7. With probability at most the tile chosen in Step (2b) is not cut 
by Hq. 

Proof: Let U be the set of tiles that are not cut by Hq. Then the hefts of the tiles 
in U sum up to at most 5q , since otherwise the monotonicity of the heft function 
would imply that Hq has a heft of more than Sq. Since the sum of the hefts of 
all the tiles is at least 2So, and each tile is chosen with probability proportional 
to its heft, the probability of choosing a tile from U is at most □ 

The lemma directly implies that the weight of Hq is increased in Step (2c) 
with probability at least i. This results in a slightly weaker lower bound for 
w{Hq) as compared to the MAX case in the previous subsection. To deal with 
this weaker lower bound, we compute an a-good partitioning with a = 

instead of (^ 2 +e)po ' remainder of the analysis is then along the lines of the 
analysis for the MAX case, and we get the following result. 

Theorem 9. For any Sq and any e > 0, a pxp partitioning FI with heft at most 
2 Sq and p < (4 + e)po can be computed in expected time 0{{n +p^ • tq) ■ plogn), 
where po is the minimum number such that there exists a po x po partitioning 
with heft at most Sq, and tq is the time needed to compute the heft of any tile. 

Discussion and Extensions. We point out that the algorithm can be easily 
made deterministic by modifying Steps (2b) and (2c) such that instead of choos- 
ing one particular tile at random, we take every tile rij and increase the weight 
of the elements in Ri j by a factor that is proportional to the heft of the tile. Also, 
the algorithm can be generalized to yield a trade-off between the approximation 
of the error and the approximation of the number of cuts. 

As before, in many cases tq can be implemented in 0(1) steps by performing 
appropriate preprocessing. An interesting example is the SUM-SUM-SQR_DIFF 
metric, which models the case when we wish to form tiles containing similar 
values. Finally, the result can also be extended to p x q partitionings with p ^ q 
and to higher dimensions, resulting in the same bounds as for the MAX case. 

7 An Example Application of Resnlts 

We focus on one database application where partitioning problems arise; see 120 ! 
for the implications of our results in other applications. 

Consider a database over relation R with n numerical attributes. This can be 
visualized as a multidimensional array A with one attribute along each dimension 
in which each array element contains the number of tuples in the database with 
the associated attribute values. This is the joint frequency distribution of the 
database. Histograms partition this distribution into rectangular regions {buck- 
ets) and approximate each region using a small amount of space. Typically, the 
frequencies in a bucket are approximated by their average. 

Histograms are typically used to estimate the result sizes of relational queries. 
The errors in the estimation depend mainly on the bucketization. A theory of 
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optimal histograms has been developed m in which a number of histograms 
have been identified as being optimal for various queries and operators. These 
histograms, such as Equidepth, Equiwidth etc can all be considered as special 
cases of our partitioning problems, and our results apply to them uniformly. 
Here, we focus on an important class of histograms, namely, the V-Optimal his- 
togram which is provably the most accurate in several estimation problems m 

Definition 2. V-Optimal Histogram {7^ .- For a given number of buckets, 
a V-Optimal histogram is the one with the number of buckets bounded by the 
specified threshold, but having the least variance, where variance is the sum of 
squared differences between the actual and approximate frequencies. Alternately, 
for a given total variance, the V-Optimal histogram is one with the least number 
of buckets with total variance bounded by the specified threshold. 

Note that for joint distributions over two attributes, one version of the prob- 
lem of constructing the optimal E-optimal histogram is identical to the par- 
titioning problem with arbitrary partitions under the SUM-SUM-SQR_DIFF 
metric. By applying our general results to this partitioning problem, we derive 
the following results (further details are in m-)-- 

1. Identifying the optimal V-Optimal histogram with arbitrary buckets is NP- 
Hard. 

2. The greedy MHIST algorithm presented in the literature P2j can result in 
arbitrarily poor histograms in terms of the buckets, as well as the total 
variance, whichever is being optimized; we can construct inputs to induce 
such worst case behavior. In fact, this applies to many other greedy solutions 
we can design for this problem. 

3. We can approximate the minimum number of buckets needed to achieve the 
threshold variance in the V-Optimal histogram by using results in Section 
lb. 'A The resulting algorithms work in near-linear time and produce small 
factor approximations. 

8 Conclusions 

We have considered the complexity of partitioning problems for different parti- 
tions and metrics. These problems are fundamental, and they arise in application 
scenarios such as histogram-based selectivity estimation, constructing grid files, 
load balancing, and many others. Very little is known about the complexity of 
these problems except for some special metrics, and heuristics with no proven 
guarantees on the quality of the solution are used. 

In this paper, we show that many natural versions of the partitioning prob- 
lem are NP-hard and thus it is unlikely they have efficient (polynomial time) 
exact solutions in the worst case. Our main results, however, are positive ones. 
We present highly efficient (near-linear time) algorithms that approximate the 
solutions to within small constant factors, for different partitions and metrics. 
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We applied our general results to solving an important problem arising in 
query result size estimation: the identification of y-Optimal histograms in two 
dimensions. Existing greedy algorithms do not offer any quality guarantees for 
this NP-Hard problem; our approximate solutions to the partitioning problems 
imply the first known efficient algorithms for this problem with guarantees. We 
are investigating its impact in practice. 
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Abstract. In this paper, we examine the complexity of multi-dimen- 
sional range searching in non-replicating index structures. Such non- 
replicating structures achieve low storage costs and fast update times 
due to lack of multiple copies. We hrst obtain a lower bound for range 
searching in non-replicating structures. Assuming a simple tree structure 
model of an index, we prove that the worst-case time for a query retriev- 
ing t out of n data items is + t/b), where d is the data 

dimensionality and b is the capacity of index nodes. We then propose a 
new index structure, called the O-tree, that achieves this query time in 
dynamic environments. Updates are supported in 0(logj, n) amortized 
time and exact match queries in 0(log^ n) worst-case time. This struc- 
ture improves the query time of the best known non-replicating struc- 
ture, the divided k-d tree, and is optimal for both queries and updates 
in non-replicating tree structures. 



1 Introduction 

Multi-dimensional range searching has important applications in geographic in- 
formation systems, image databases, constraint databases, CAD, and computer 
graphics. Over the past few decades, researchers have developed several data 
structures that support range searching. However, experiments have shown that 
these structures seldom scale with data dimensionality in practice. This led to 
theoretical investigations and several results on lower bound characterizations of 
range searching in the past few years. These results can be grouped into two cat- 
egories. The first category of results estimates the overhead of retrieving multi- 
dimensional data items in external memory environments without accounting 
for the overheads of heating (or identifying) the disk blocks where they reside. 
The second category of results assumes a specific model for the index structure 
and the query navigation process. These results estimate the complete overhead 
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for range searching within the assumed “index model” . We describe some of the 
recent results in each class in turn and put our contributions in perspective. 

Hellerstein, Koutsoupias, and Papadimitriou IHklSTI investigate the com- 
plexity of retrieving d-dimensional data items from secondary storage once the 
desired data items are identified by an index structure. Assuming that data 
items are stored and retrieved in blocks of b items each in external memory, 
they investigate the relationship between the storage redundancy (r) of an in- 
dexing scheme, and the worst-case access overhead (a) for a range query. The 
storage redundancy of an indexing scheme is measured as the ratio of the stor- 
age occupied to the minimum storage required. The access overhead for a range 
query that retrieves t data items is defined as the ratio of the number of blocks 
required to answer the query and the minimum number required (i.e., t/b). 
(Note that the overheads of locating the data items is not included). For the 
case of a 2-dimensional rectangular grid, they prove that r = l7(log 5/a^ log a) 
and present a tight lower bound on the access overhead when r = 1. Recently, 
Samoladas and Miranker extend these results and establish a nice lower 

bound of r = l7(log6/loga) for a two-dimensional grid. Furthermore, for the 
d-dimensional case, they prove that r = l7((log 6/ log In a different vein, 
Koutsoupias and Taylor iKinHi, consider a different instance of input workload. 
Instead of a 2-d grid, they consider a 2-d grid rotated by the golden ratio; the 
resulting workload is termed the fibonacci workload. They show that for this 
particular workload, the access overhead must be b (highest possible) for any 
storage redundancy that is less than clog n, c being a constant. Note that all the 
above bounds are concerned with the accessing of disk blocks. We concentrate 
on the indexing problem: how to locate the relevant disk blocks that contain the 
objects of interest. 



Complexity results that include locating overheads assuming a specific index 
structure model also exist in the literature. By modeling index structures as 
DAGs, Chazelle showed that a storage space of ^(^( lo'giQgn )^^"^) is essential 
to achieve a query time of O (poly-log n + t). Note that the poly- logarithmic 
factor in query time indicates the overhead of locating data items in the index 
structure. In the external-memory model {i.e., b » 2), Sairam and Subramanian 
extend this result to 2-dimensions and propose the P-range tree, which 
achieves a query time of O(log); n + t/b) query time for any constant c > 0 using 
^(F iog°o^ ri ) storage for n 2-dimensional data. 



In this paper, we examine the worst-case overheads of locating and retrieving 
data items for range queries using non-replicating tree structures. This model 
requires low storage space and allows for efficient updates. It is representa- 
tive of most existing database structures such as Quad trees |bam89j . R-trees 
pKSS90i ( Tut84 IBKK9]^ , BANG files |Fre87IFre9^ . K-D-B trees jR.ob81j. and 
hB-trees We prove that the worst-case range query overhead in such 

non-replicating tree structures is bounded below by I 7 ((n/ 6 )(‘^“^k<^ j for n d- 
dimensional data. This result extends Mehlhorn’s result from a ternary 

decision-tree model to arbitrary 5-ary trees. (Note that the results on retrieval 
overhead mmmmm also yield a partial query overhead (assuming 
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r = 1); this needs to be added to the indexing overhead to obtain the total 
query overhead. In our model, we obtain a single term for the total query over- 
head.) Our lower bound result explains why popular database tree structures 
such as R*-trees EEHSnOl, SS-trees IW^l . and SR-trees do not scale 

with data dimensionality. 

After presenting our lower-bound results, we propose a new non-replicating 
tree structure, called the 0-tree (O for optimal), that achieves the proposed lower 
bound in dynamic environments. Note that in static environments, this bound is 
met by Bentley’s k-d trees [Ben75| . In dynamic environments, the 0-tree removes 
a logarithmic factor from the overhead of the best known non-replicating tree 
structure, the divided k-d tree IvkODll . In contrast, popular structures such as 
R-trees and Bang files have linear overheads (i.e., a query time of 0(n) for n 
data items) as shown in EESH- 

2 Related Work on Non-replicating Structures 

Tabled describes the related work on replicating index structures and summa- 
rizes our contributions. The complexity results are expressed for n d-dimensional 
data items, assuming a node capacity of b for the data structures. Query com- 
plexities also indicate the impact of the number of data items retrieved, t. For 
non-replicating index structures, we first list the Pseudo Quad and k-d trees, 
which are defined using quad and balanced binary trees, (i.e., 6 < 4). They 
achieve a query time of + t) and update time of O(log^n). The 

divided k-d tree reduces the query time overhead to log^^^^ n) and 

update time overhead to O(log^n). The Pseudo Quad and k-d trees assume a 
Quad-tree or balanced binary tree representation for the data structures. Our 
results are more general and are defined for arbitrary node capacities. In particu- 
lar, we prove that the query time in non-replicating structures is bounded below 
by + The 0-tree presented in this paper achieves this query 

complexity while supporting updates in 0(logj n) time. This structure is optimal 
for both queries and updates in non-replicating tree structures. Note that when 
the node capacities in an O-tree are restricted to 3 (as in 2-3 trees), the query 
complexity of O-tree is better off by a factor of O(log^^'^n) with respect to the 
best known non-replicating structure - the divided k-d tree. 

The rest of the paper is organized as follows: In Section 0 we describe our 
model for non-replicating data structures and obtain a lower bound for range 
query time complexity in these structures. In Section 2] we examine how to 
extend this result to dynamic environments. We propose a new structure, called 
the O-tree, for this purpose. This structure is optimal for both queries and 
updates in non-replicating tree structures. The final section summarizes these 
results. Appendix A gives details of the lower bound results. 

3 Lower Bound for Range Query Time 

In this section, we describe our model of external-memory non-replicating data 
structures. We then obtain a lower bound for answering range queries in such 
non-replicating structures. 
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Capacity b 


Non-Replicating 

Structure 


Query 

Time 


Insert/Delete 

Time 


b<4 


Lower Bound 
IMeh84l 

Pseudo k-d tree 
lOvLHlI 


(d-1) 

0(n d ^t) 

(d-i)(i+c) 

0{n d -\-t) 


0(log n) 
0(log^ n) 




Divided k-d tree 
lOvLMlI 


(d-1) , , 

0(n 3 log 


O(logn) 




Lower Bound 

[this paper] 


0((n/b) d +t/b) 


0(logi, n) 


Arbitrary b 


O-tree 

[this paper] 


0[{n/b) d +t/h) 


0(logj, n) 




R-trees, Bang Files 
|Cuts4llVe8^| 


0{{n/b) t/b) 


0(logi, n) 



Table 1. Comparison of index structures for dynamic range searching: The 
results of this paper are highlighted. 



3.1 Model of a Non-replicating Index Structure 

We model an index structure as a rooted tree, in which each node is of finite size, 
and is associated with a search predicate. The search predicate for d dimensions 
is assumed to be a conjunction of at most 2d binary comparisons. We denote 
the search predicate for a vertex v as cond(v). Each comparison in the search 
predicate involves an endpoint of a query (i.e., the start or the end of a query 
along a specific dimension) and evaluates to true only if a specified query satisfies 
all the comparisons. For example, a 2-dimensional search predicate involving such 
comparisons may be specified as (gf > 1) A ((gf < 4) A (g| > 0) A (g| > 10). We 
refer to the conjunction of comparisons, which involve the query endpoints in 
a specific dimension i, by conjunct Ci. For instance, for the example predicate 
the conjunct C\ refers to (gf > 1) A ((gf < 4) and the conjunct C 2 refers to 
(g| > 0) A (g| > 10). In addition to the search predicate, an index node also 
stores at most b data items, where 6 is a constant, and a finite number (also 
assumed to be b) of child pointers. Each data item is stored in exactly one node 
of the index (non-replicating property). When presented with a query, an index 
node returns a partial result set and forwards the query to a subset of all children, 
whose search predicates are satisfied by the query. At the leaf level, index nodes 
do not have any children; the response to a presented query is just a partial 
result set. The result of a query is obtained by the union of the result sets of all 
visited index nodes. 
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The above model for an index structure is similar to the tree model of Vaidya 
. However, there are two differences: (1) Search predicates are represented 
as a conjunction of comparisons rather than as a disjunction of comparisons. 
(2) The data associated with an index node is bounded. This simplifies the 
model, since the cost of accessing an index node is no longer proportional to the 
associated number of data items; instead, it is a constant. This model is a direct 
extension of Mehlhorn’s decision tree model from ternary trees to arbitrary b- 
ary trees, assuming each node has at most b children. Note that this tree model 
is more restrictive than Chazelle’s model ITTCTl . where data structures are 
represented as DAGs, or the more general “pointer-machine” model of Tar j an 
ITarTDI . Non-replication is hard to enforce in a graph model since multiple paths 
to same data items are possible without actually replicating the data items 
themselves. Therefore, we focus on a tree structure model, which captures the 
basic limitations of external-memory data structures such as finite node sizes, 
and models most existing data structures. This model for non-replicating index 
structures excludes threaded search trees where data items can be reached via 
multiple paths in the tree, for instance the Cross tree of Grossi and Italiano 

fTTro?] . 

Next, we describe complexity measures for comparing data structures in this 
model. The space complexity of a data structure is determined by the total 
size for leaf and non-leaf nodes in the index structure. Query and update time 
complexities are estimated using the number of nodes visited in the index during 
a query and an update respectively. For queries, we measure the worst-case 
complexity. For updates, which are much less frequent, we measure the amortized 
time complexity over a sequence of update operations. 

3.2 Lower Bound 

In this subsection, we obtain a lower bound on range query time complexity 
in non-replicating structures. Our lower bound extends the result of Mehlhorn 
IMehSdI . His proof does not consider b, the node capacity of index nodes. As 
a result, the ensuing lower bound is independent of b. For realistic external- 
memory devices, the value of b could be quite large and cannot be ignored in 
complexity analysis. Our lower bound captures the exact dependence of range 
query time complexity on node capacity of an index structure along with the 
number of points and their dimensionality. The proof proceeds by considering a 
specific data set and is straightforward. 

Since a lower bound in a restricted domain also holds for much general ones, 
we restrict the data points to have only integer values for their coordinates. 
Without loss of generality, the constants in edge conditions are also assumed 
to be integer values. To obtain a lower bound for range query time complexity 
in non-replicating index structures, we work with the following layout of n d- 
dimensional points. For simplicity, let n = k‘^. 

S2 = {(2ai-l, 202-1, . . . , 2a,-l, . . . , 2aa-l) | o, G {1, . . . , fc}, Vi G {1, . . . , d}}. 

This set S2 forms a d-dimensional grid of length k = in each dimen- 
sion with a 2-unit separation between every two adjacent vertices. We omit the 
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superscript n, wherever possible, for clarity. For any subset S of Sd, we define 
an z-projection set Pi{S) as the set of X^-coordinates of all points in S, i.e., 
Pi{S) = {pi I (pi, . . . ,pd) € S}. Such a projection set has a gap at x if 

— X lies in the range of Pi{S) (i.e., 3a, b G Pi{S) : a < x <b), and, 

— a; is even. 

We concentrate on the following set of queries Q = Ui<i<dQi’ where Qi is 
the set of queries with the following specification: 

— The interval along the dimension Xi is [2c, 2c] for some c: 1 < c < fc, and, 

— Intervals in all other dimensions are [1,2k + 1]. 

Note that none of the queries in Q retrieve any data from the set Sd- Hence, 
they are referred to as hole queries in general, and as z-hole queries if they are 
specifically from Qi. Also note that the projection of an z-hole query to the 
dimension Xi corresponds to a gap. Since the number of gaps in each dimension 
is rz^/'^ — 1, the total number of holes in all the d dimensions is — 1) holes. 

Using these observations, we show that there exists a hole query in any non- 
replicating index structure for Sd that has a query complexity of 

Theorem 1 The worst-case time complexity of a range query retrieving t out of 
n d-dimensional points in a non-replicating index structure is + 

tib). 

Proof: Let n = k'^. Let T be any non-replicating index structure for the set of 
points S^. Then from Lemma in Appendix A, the total query cost in T for all 
the holes is at least {n — o{n))/{b^'^~^'>^'^). Since the number of hole queries is at 
most d{n^/'^ — 1), by pigeonhole principle it follows that there is a hole whose 
query overhead is Combining it with the minimum cost t/b for 

accessing t data items stored in 6-ary nodes, we obtain the desired bound. □ 

4 Optimal Non-replicating Structure for Dynamic 
Environments 

In this section, we describe 0-tree, an optimal non-replicating structure for dy- 
namic environments. We first introduce some preliminary concepts for organizing 
2-dimensional data in Section im We then describe an intermediate structure 
for 2-dimensional data, called the C-tree, that supports range queries in opti- 
mal time and insertions and deletions in 0(log^ rzloglog^ n) time, in Section ^21 
In subsection I4.;tl we refine the C-trees to 0-trees to reduce insertion/deletion 
times to O(logrz). These results are then extended to an arbitrary number of 
dimensions. 

4.1 Basic Concepts 

Notation A 1-dimensional interval I is specified by a starting point (denoted 
P) and an ending point (denoted P) in a linear domain. Interval / is contained 
in interval J if ((/® > J®) A {P < J^)). Interval / intersects interval J if ((/® < 
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J®) A (J^ < I®)). Note that containment is a special case of intersection. A 
2-dimensional interval object I is specified by 2 1-dimensional intervals, one in 
the A-dimension and the other in the A-dimension. This object / is contained 
in another 2-dimensional object J if the containment holds in each dimension. 
Similarly, it intersects J if the intersection holds in each dimension. We say 
a 2-dimensional interval object / is contained in another object J in a specific 
dimension if the interval of / in that dimension is contained in the corresponding 
interval of J. These definitions can be easily generalized to arbitrary dimensions. 

Partitioning 2-dimensional Data Consider a set S of 2-dimensional points. 
These points can be ordered using their A-coordinates. If the A-coordinates 
match, the points can be ordered using their A-coordinates. Alternately, they can 
be ordered using their A-coordinates first, and then using their A-coordinates. 
Let the former ordering be referred to as the A-order and the latter as the A- 
order. We denote that point p\ precedes point p 2 in A-order by p\ P 2 - Like- 
wise, Pi P 2 denotes the precedence of point pi over p 2 in A-order. Let So- de- 
note the sequence of points in S sorted using cr-order, where a denotes either the 
A- or the A-dimension. The extent of such a sequence can be represented by the 
first and last points in the sequence. This extent denotes the extent of the set S in 
cr-order. An alternative but not equivalent representation for the extent of the set 
S is by its bounding rectangle, denoted by b{S). This bounding rectangle has an 
interval for each dimension. The interval in A-dimension is denoted by bx{S) and 
is obtained by taking the minimum and maximum values of the A-coordinates of 
the points in the set S. Likewise, the interval in A-dimension is denoted as 5y(S') 
and is obtained using the A-coordinates. To illustrate these concepts, consider 
an example: Let S' = {(1, 1), (2, 6), (2, 4), (3, 5), (5, 1), (6, 6), (6, 4), (7, 5)}. The 
A-order orders the points in S using the A-coordinates. Points that have same 
A-coordinates, for instance (2, 4) and (2, 6), are ordered using the A-coordinates. 
The extent of the set S' is the first and last points in this sequence, which 
are (1,1) and (7,5). The minimum and maximum A-coordinates for points in 
S' are 1 and 7. Therefore, the A-interval of the bounding rectangle for S' is 
bx(S') = [1,7]. Similarly, the minimum and maximum A-coordinates are 1 and 
6. Therefore, the A-interval is 6y(S') = [1,6]. 

Total orders, such as and ^y above, can be used to partition a set S into 
smaller subsets that are disjoint and are of approximately same size. We refer 
to such subsets as partitions of S in the specified order. For example, the set S' 
can be divided into 4 partitions, P\, . . . ,Pa using A-order as: 

Pi = {(1,1),(2,4)},P2 = {(2,6),(3,5)}; 

P3 = {(5,1),(6,4)},P4 = {(6,6),(7,5)}. 

A total order on individual points can be extended to the extents of partitions 
of a set S. Partition extent (a, b) precedes another partition extent (c, d) in cr- 
order if b precedes c in cr-order. 

B-tree for Partitions of 2-dimensional Data Let P denote the set of parti- 
tions for a set of 2-dimensional points obtained using cr-order. Since the extents 
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of the partitions in P can be totally ordered using cr-order, a B-tree (the variant 
that stores data in the leaves) can be constructed over the partitions using these 
extents. We refer to this tree as a B-tree for the partitions in cr-order, and denote 
it by B{\P\, (a)). Figuredshows such a tree for the partitions of S'. Such a tree 
facilitates efficient location of necessary partitions. For instance, identifying a 
partition in which a new point s is to be stored can be accomplished as follows. 
We start at the root of the B-tree and traverse down the tree by choosing the 
entry (corresponding subtree ) such that p\ s Ao- where {p\,P 2 ) 

and {Pi^^,P 2 ^^) are the extents associated with the and (i-l- 1)®* entries. For 
example, for inserting a new point (4, 1), we choose the left subtree at the root 
and then the partition P 2 at vertex v. In general, for a tree of k partitions, this 
procedure takes logfc time. 



B(4, <x>) 




Fig. 1. B-tree for 4 partitions of S' using Ai-order. 



In addition to the partition extent information used during updates, each 
index entry is associated with a bounding rectangle to aid in the navigation of 
queries. The intersection of an arbitrary query with the bounding rectangle rep- 
resents the edge condition associated with an index entry. Given this structure, 
a query q = (qx,<lY) at vertex u accesses a child vertex v only if it intersects 
the bounding rectangle, denoted by for dimension i, for vertex v (i.e., if 

the query satisfies the condition (g| <= rm) A {h <— qf) for each dimension 
i e {X,Y} associated with the edge {u, v)). For example, in Figure^Ka) a query 
at the root vertex accesses vertex v only if it intersects the bounding rectangle 
([1,3], [1,6]) associated with the entry for v in the root vertex. Since checking 
for intersection involves a conjunction of binary comparisons, this structure sat- 
isfies the model of Section 0 Before proceeding further, we state the following 
property for a B-tree on partitions constructed as above. 

Lemma 1 For any query q and any dimension a at most 2 partitions that in- 
tersect the query are not contained in the query in dimension a. 

4.2 C-tree: An Optimal Structure for Queries 

Let us now examine how best to organize an arbitrary 2-dimensional data set 
S while achieving good update times. Assume the data set is partitioned into k 
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subsets using F-order. (The best value of k for a set S' of n points will be de- 
termined during the course of the discussion). Figure |2Ka) shows the partitions 
Pi, . . . ,Pk for the 2-dimensional set S. By maintaining a B-tree for these parti- 
tions, we can locate partitions intersecting a query q = {qx^qy) efficiently. Of 
these intersecting partitions, there are at most two extreme partitions which are 
not contained in q in the ^-dimension. Partitions Pi and Pj are such extreme 
partitions in Figure l^a) for the query q. All other intersecting partitions are 
contained in the query interval qy and hence are referred to as interior parti- 
tions. In Figure 0a), partitions Pi+i to Pj-i are interior partitions. In these 
partitions, we have to search using the query interval qx- This search can be ac- 
complished efficiently if the data in these partitions is ordered according to their 
A-coordinates. Hence, each partition is subdivided into m sub-partitions using 
A-order. The sub-partitions for each partition are organized as a B-tree. Fig- 
ure m shows the resulting organization. We have two layers of B-trees. The top 
layer (referred to as layer-2 tree) has k partitions using Worder. Each partition 
of this tree is organized as a layer- 1 B-tree of m partitions using A-order. 

Range Queries Let us examine how range queries are processed in the above 
organization. As specified in the model of Sectional a query q = {qx,qv) starts 
at the root of the layer-2 B-tree and accesses a vertex v only if q intersects the 
associated bounding rectangle (i.e., satisfies the corresponding edge condition). 
We trace the worst-case time complexity t for the query as follows: 




Fig. 2. (a) Partitions and sub-partitions for 2-dimensional data, (b) 2-layered 
tree over the sub-partitions. 

1. Find all the partitions intersecting query q in the layer-2 B-tree. 

2. For each interior partition (the A-interval is contained in qy), such asPi+i,. . ., 
Pj-i in Figure El) a), search in the layer-1 B-tree as follows: 

(a) Find all sub-partitions that intersect q in the corresponding B-tree. 

(b) For the interior sub-partitions (i.e., for the sub-partitions with A-inter- 
vals contained in qx), retrieve all associated data. All such data form 
part of the query result. 

(c) For the extreme sub-partitions, search in the sub-partitions. 
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3. For the extreme partitions, such as Pi and Pj in Figure |2 a), 

(a) Find all sub-partitions that intersect q in the corresponding B-tree. 

(b) Search in these sub-partitions. 

Step 1 of the algorithm takes 0{k) time, since the B-tree has at most k parti- 
tions. This is because some partitions may intersect the query in the T-dimension 
and not in the X-dimension. Let Up < k partitions intersect the query in both 
dimensions. Of these at most two of them can be extreme partitions, i.e., the 
y-intervals of their bounding rectangles are not contained in the corresponding 
query interval. Consequently, step 2 is executed (rip — 2) times, whereas step 3 
is executed at most 2 times. For each of the {rip — 2) interior partitions, step 
2(a) has overheads of log m in the corresponding layer-1 B-tree. Note that in this 
case, logarithmic overheads are achieved as opposed to the linear overheads in 
Step 1 for the following reasons: First, all sub-partitions intersecting q are stored 
as a linear sequence using the Al-order. Second, all of these sub-partitions, being 
parts of an interior partition, are contained in the query in the T-dimension 
(i.e., qy contains their corresponding T-intervals) . Next, note that the interior 
sub-partitions are contained in the query in both X- and y-dimensions. Hence, 
all data points in such sub-partitions are part of the query result. Assuming they 
are organized using space linear in the number of points, the retrieval cost in 
these sub-partitions is directly proportional to the query result. Hence, they do 
not contribute to the query overheads. For the extreme sub-partitions in step 
2(c), we continue the search in the sub-partition using the query q. Since each 
of these sub-partitions has 0(n/(km)) data points, let us denote the associated 
overheads by ST{n/{km)). This cost is also borne by the sub-partitions of the 
extreme partitions (step 3(b)). Note that the cost of finding these sub-partitions 
in each extreme partition is 0{m). Summing up, the total overhead t is: 

t = k + {k — 2) * [log m -I- 2 * ST{n/{km))\ -|- 2 * [m -I- m * ST{n/{km))\ 

= k + {k — 2) log m -I- 2m -I- [2{k — 2) -|- 2m] * ST{n/{km)). 

By organizing the sub-partitions in different ways and finding the optimal 
values for k and m, we achieve different structures. We consider three such orga- 
nizations for the sub-partitions. The first two, will be discussed in this subsection. 
The third yields the 0-tree and is described in the Section ^31 

Assuming that the search in the sub-partitions involves a linear scan of all 
the data points (i.e., ST{V) = 1), the above equation yields a minimum value of 
0{i/{nlogn)) for t, when k = m = 0{i/{n/ log n)). The same overhead is also 
obtained when k = 0(\/(n/logn)) and m = 0{^/{n\ogn)). This latter organi- 
zation corresponds to the divided k-d tree of Kreveld and Overmars mm- 

Consider an alternative organization for the data in sub-partitions of Fig- 
ure Elb) to reduce the search costs. Assume each sub-partition is organized 
as a static k-d tree |R,a,v98) . Then the search cost in a sub-partition of size 
c = 0{n/ (km)) reduces from c in the divided k-d tree to ^/c. Consequently, the 
query overhead becomes 
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t = k + (k — 2) logm -I- 2m + [2{k — 2) -|- 2m] * 0{^/{n / {km))) 

This equation has minimum value of 0{^/n) when k = m = 0(y^/logn). Note 
that this query overhead corresponds to the lower bound (for 2-dimensional data) 
obtained in Section 0 Hence, this structure is optimal for range queries. We refer 
to this structure as the combined k-d tree, C-tree for short, since it combines the 
divided k-d tree with the static k-d trees. As in the divided k-d tree lvKU91l . the 
C-tree has 2-layers of B-trees, each organized on a different dimension. However, 
at both the layers, every B-tree has k = 0{-\Jnj logn) partitions. The exact value 
of k will be specified later. We refer to these two layers as a 2-layered B-tree of 
k partitions. In addition, the C-tree has a third layer that is organized as static 
k-d trees. We refer to the trees in the third layer as the leaf partitions of the 
2-layered B-tree. 

C-tree for n 2-dimensional points 

= 2-layered B-tree of 0{^/n/ logn) partitions 
-I- 2-dimensional k-d tree of 0(log^ n) points. 

Generalizing to d dimensions, a C-tree contains d-layers of B-trees followed 
by a static k-d tree. Each layer-i B-tree contains 0{^/n/ logn) partitions and is 
constructed using the coordinates in dimension Xi (A^-order to be precise). The 
static k-d trees at the leaves of these d-layers are constructed for 0(log^ n) points 
each. Note that the above organization yields a tree height of d* |"log(n^/‘^/ log n)] 
in the d-layered tree and [log(log‘^ n)] for the static k-d tree. Due to the ceiling 
factors in the equation, the resulting height is 0([lognJ -I- d) in the worst-case. 

Exact-Match Queries Next, let us analyze the complexity of exact-match 
queries in a C-tree. This is useful for describing and analyzing insertions and 
deletions later. Given a specific location p, an exact match query retrieves all data 
points in the C-tree located at p. This can be accomplished by first identifying a 
sub-partition in which p may be stored and then searching in that sub-partition, 
which is organized as a static k-d tree. Since each sub-partition has 0{n/ {km)) = 
0(log^ n) points, the latter part of the search can be accomplished using the 
corresponding algorithm for a static k-d tree in O(loglog^n) time. Next, we 
describe the time for locating a sub-partition in which p could be stored. 

1. Start with the root v of the layer-2 B-tree. 

2. Search in the tree rooted at v using p as follows: 

(a) Let V have m entries ([ei, ci], . . . , [Cm, Cm]) ordered using E-order. Then, 
choose the entry {[a, Ci]) such that the extent -<y P e^+i (i.e., 
Ci precedes p and p precedes e^+i in E-order). If no such entry exists, 
then choose the first entry. 

(b) Repeat the above step for the corresponding child c till we reach a par- 
tition P of the B-tree. 

3. The partition P is organized as a layer-1 B-tree. Repeat Step 2 using this 

tree to identify the sub-partition in which the insertion/deletion has to occur. 

Since the layer-1 B-tree uses A-order, the search criterion in Step 2(a) uses 

A-order. 
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Unlike range queries, the above algorithm for sub-partition identification uses 
the extents in the B-tree. Since the disjointness of the partitions is reflected in 
their extents, only one entry is chosen in Step 2(a). These extents are stored 
in sorted order and each B-tree has at most 0{y/n/ \ogn) partitions. Therefore, 
Step 2 costs 0(log n) time. The same process is also repeated in Step 3. The total 
time complexity of this algorithm is O(logn). Hence, sub-partition identification 
and subsequently exact match queries are answered in the C-tree in logarithmic 
time. 

Insertions and Deletions Let us now analyze the impact of organizing each 
sub-partition as a static k-d tree on the complexity of insertions and deletions. 
Insertions and deletions are performed as follows: 

1. Identify the sub-partition in which the insertion/deletion has to occur. 

2. Perform the insertion/deletion in the static k-d tree corresponding to the 
identified sub-partition. 

3. Re-balance the trees, as necessary. 

From Section PT71 we know that the first step takes O(logn) time. Next, we 
analyze the complexity of Step 2. As mentioned earlier, the static k-d tree can 
support insertions and deletions without degrading query performance only if 
it is reconstructed after each insertion/deletion. Consequently, insertion in each 
sub-partition takes 0(c log c) time, where c is the number of points in the static 
k-d tree. Since this number is 0(n/ {km)) — O(log^n), each insertion/deletion 
takes at least 0((log^ n) log(log^ n)) time in Step 2. In Step 3, re-balancing of 
the trees can be accomplished using the technique of partial rebuilding ITK^ . 
Here, each B-tree of n partitions is reconstructed after an more partitions are 
created or deleted, where a is a small fraction. Since there are at most O(logn) 
levels in a tree of n points, the time for such reconstruction is 0{nlogn). The 
number of insertions, deletions that trigger such reconstruction is proportional 
to the number, n, of data points in the tree. Consequently, the reconstruction 
time for each layer B-tree is amortized over the number of updates that occur in 
it. This adds a logarithmic amortized cost to each update due to reconstruction 
of each layer B-tree. Since the number of such layers is 2 * d = 4 for d = 2 (i.e., 
for 2-dimensions), the total accrued cost for an update is 0{d*logn) = O(logn). 

Combining the above three costs, a C-tree for n points supports range queries 
in optimal time and insertions/deletions in 0{logn + (log^ n) log(log^ n)) time. 

Theorem 2 There exists a non-replieating data structure that supports range 
queries on n d-dimensional points in 0{n^‘^~^^^‘^)) overheads. Insertions and 
deletions are supported in 0(log^ nloglog^ n) amortized time. 



4.3 O-tree: An Optimal Structure for Queries, Insertions, and 
Deletions 

Consider the C-tree for n points. Range queries are answered in optimal time 
in this structure. The insertion/deletion time, however, is non-optimal and has 
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two principal components - 0(log n) complexity in the 2-layered B-tree and 
0((log^ n) log(log^ n)) complexity in the static k-d trees. If we used only a 
static k-d tree, it would have an optimal query complexity and 0(n log n) in- 
sertion/deletion time. Note that by going from a static k-d tree to a C-tree, 
we have successfully reduced the insertion/deletion time from O(nlogn) to 
0((log^ n) log(log^ n)) without affecting the query time. Hence, a C-tree can 
modularly replace a static k-d tree to reduce the insertion/deletion time. Based 
on this observation, we organize the leaf partitions of a C-tree not as static k-d 
trees but as C-trees. This reduces the insertion /deletion time from 0(logn-|- 
(log^ n) log(log^ n)) to 0{logn+ (log^ log^ n) log(log^ log^ n)). This yields a total 
time of O(logn) for insertions and deletions. Since this organization has opti- 
mal query and insertion/deletion times, we call it the 0-tree. To summarize, the 
0-tree has the following structure. 

0-tree(2,n) = 2-layered B-tree of 0{^/n/ logn) partitions 
-I- C-tree of 0(log^ n) points. 

Since a C-tree is a 2-layered B-tree followed by static k-d trees, we have 
0-tree(2,n) = 2-layered B-tree of 0(y/n/ logn) partitions 

-I- 2-layered B-tree of 0(logn/loglogn) partitions 
-I- 2-dimensional static k-d tree of O(log^log^n) points. 

This organization for the 0-tree is shown in Figure 0 At the first level, we 
have a 2-layered B-tree of 0(y/n/ log n) partitions. The leaf partitions of this 2- 
layered B-tree have 0(log^ n) points and are organized as second-level structures. 
Each second-level structure is a 2-layered B-tree of 0{logn/ log log n) partitions. 
The leaf partitions of these second-level 2-layered B-trees are organized as static 
k-d trees of 0(log^ log^ n) points at the third level. 




Fig. 3. 0-tree for n 2-dimensional points. 



The above arguments show that this organization achieves optimal range 
query time and logarithmic update time. The update times are amortized bounds. 
Note that in the above description, we assumed a 2-3 tree as the basic building 
structure. By appropriately modifying it with an arbitrary 6-ary tree, we obtain 
the following result. 
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Theorem 3 There exists a non-replicating data structure that supports range 
queries on n 2 -dimensional points in 0{^/{n/b)-\-t/b) overheads, where t denotes 
the size of the query result. Insertions and deletions are supported in O(logjn) 
amortized time. 

The 2-dimensional 0-tree can be generalized to d-dimensions by making the 
following changes: 

1. Replace the first-level 2-layered B-tree by a d-layered B-tree of 
Ofn}/'^ / \ogn) partitions. 

2. Replace the second-level 2-layered B-trees by d-layered B-trees of 
0(logn/loglogn) partitions. 

3. At the third level, replace each 2-dimensional static k-d tree by a d-dimen- 
sional static k-d tree of 0(log‘^ log'^ n) points. 

In summary, an 0-tree for n d-dimensional points is organized as: 

0-tree(d,n) = d-layered B-tree of / \ogn) partitions 

-I- d-layered B-tree of 0(logn/loglogn) partitions 
-I- d-dimensional static k-d tree of 0(log‘^ log'^ n) points. 

Note that each C-tree of m points is of height 0(d* |"log{,m]), and the 0-tree 
of n points is of height of 0{2d * [log^n]) = O(logjn) in the worst-case. This 
leads to 0(d * log^n) update and exact match query time complexities. The 
analysis of the query time complexity directly extends from the 2-dimensional 
case. This leads us to the following result. 

Let us examine the storage cost next. Data in the static k-d trees of an O- 
tree are stored in blocks of size b. Therefore, the storage cost in the static k-d 
trees is 0{n/b). The storage costs for B-trees at each of the layers above can 
be calculated as follows. Each B-tree of k leaf partitions has k/{b — 1) internal 
nodes; i.e., uses 0{k/{b— 1)) storage space. The number of layers in the 0-tree 
is bounded by 2d. Using these observations, it follows that the total storage cost 
for an 0-tree is 0{n/b). 

Theorem 4 Range queries in an 0-tree of n d-dimensional points can be an- 
swered in -I- t/b) time, where t is the number of retrieved data 

points. Exact match queries can be answered in O(logjn) worst-case time. In- 
sertions and deletions are supported in 0(log{,n) amortized time. Storage space 
used is 0{n/b). 

5 Conclusions 

In this paper, we examined the complexity of range searching in non-replicating 
tree structures. We first proved that the time complexity of querying over n d- 
dimensional data using these structures is bounded below by -I- 

t/b), where t is the number of retrieved items, and b is the capacity of index 
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nodes. We then proposed a new structure, called the 0-tree, that achieves this 
lower bound in dynamic environments. 

In future, we plan to investigate the experimental and average-case behavior 
of 0-trees. A straightforward implementation of an 0-tree would be to map 
each node of the internal trees of an 0-tree to a disk page. However, this may 
lead to poor storage utilization since the trees could be small and may have 
low branching factor. In such cases, one could employ packing algorithms to 
combine under-filled root nodes of adjacent trees, or to store complete subtrees 
in few disk pages as in IDR.SSDbl without affecting the logical organization of the 
0-tree. Additionally, maintaining minimum bounding rectangles in index nodes 
may help in pruning irrelevant search paths. 
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Appendix A: Lower Bound Proofs 

Let T denote a non-replicating index structure. Let D{y) denote the set of data 
points in the subtree rooted at a vertex v in T. Let the cardinality of this set be 
denoted by nniv). Define a point query at Z = (Zi, . . . , Z^) to be a restricted range 
query tf'd(Z) = ([Zi, Zi], . . . [Z^, Z^])- The subscript d indicates the dimensionality of 
the query. All these symbols used in the lower bound proofs are enumerated in 
Table 0 for reference. Next, we prove several lemmata leading to Lemma 0 

Lemma 2 For a vertex v, the point query 'Fdip) satisfies cond{v) if p G D{v). 

Proof: Consider a point p G D{v). Since the index is non-replicating, point p is 
stored in only one vertex, say vertex u. In addition, p G D(v) implies vertex u 
is either the same as vertex v, or is a descendant of v. In either case, since the 
index is a tree, the path from the root to vertex u goes through vertex v. Next, 
by definition of a range query, 'Td(p) retrieves point p. This implies ^d{p) satisfies 
the conditions associated with vertices on the path from the root to vertex u. 
Since v is on this path, 'Pd{p) satisfies condfu). □ 
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Symbol 


Explanation 


Ci 


Conjunction of comparisons involving query endpoints 
in the Xi-dimension. 


condiv) 


Edge condition associated with vertex v. 


D(v) 


Set of data points in subtree rooted at vertex v. 


noiv) 


Cardinality of the set D(v). 


QTL 


Set of n data points in a d-dimensional grid. 


P{S) 


Set of Xi-coordinates of the points in S. 


gap 


Gap exists at x in Pi{S) 

if 3a, b G Pi{S) : a < x < b and x is even. 


Ml) 


Point-query at location 1. 


Qi 


Set of queries whose intervals in dimension i corresponds to 
gaps. Intervals in other dimensions span entire data space. 



Table 2. Symbols used in the lower bound proof. 



Lemma 3 For any vertex v in a non-replicating index, if p = {pi,p 2 , ■ ■ ■ ,Pd) € 
D{v), then the interval [si,ei] along dimension Xi, where Si < pi < Ci, satisfies 
the conjunct Ci ofcond{v). 

Proof: By contradiction. Consider a query q = ([oi, 6i], . . . , [od, 6^]), where 
Oi = Si, hi = 6i and aj = bj = pj for all j ^ i. Then, by assumption, q does 
not satisfy the conjunct C\ of cond{v) and therefore cond{v). Consequently, the 
point p, which falls in the query range of q is not retrieved by q. This contradicts 
the range query semantics. □ 

The following property holds for any subset U of Sd- 
Lemma 4 For any set U C Sd, 

Proof: By induction on d. 

Basis: Let d = 1. Since Pi{U) = U, the lemma trivially holds. 

Flypothesis: Assume the lemma holds for d = 1. 

Let U C Si+\. Consider the ^dimensional set U' obtained by deleting the 
coordinate for each point p G U, i.e., 

U' = {{pi , . . . ,pi) I (pi, . . . ,Pi,Pi-ei) G U}. 



Clearly, U' C Si. Hence, by induction hypothesis, 

\m')\>i\uf/K ( 1 ) 

l<i<l 

The number of points in U is at most a product of those in U' and the number 
of distinct A;+i-coordinates (given by \Pi+i{U)\). Hence, 



\Pi+i{U)\*\U'\>\U\. 



(2) 
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Since U' and Si+i have the same projection-sets in the dimensions X\ to Xi (i.e, 
Vj : (1 < i < 1) : \Pi{U')\ = \Pi{U)\), the required summation can be computed 
as 



^ |P.(;7)|= ^ |P.(i7')| + |P/+i(t/)|. (3) 

1<2</+1 l<i<l 

Using equations □ m and 0 we obtain 

The above sum is minimum when \U'\ = the minimum value being 

(Z -I- Hence, the lemma holds. □ 

Lemma 5 Let C be a conjunction of binary comparisons, each involving an 
endpoint of an arbitrary 1-dimensional interval x and a constant. Let G be a 
subset of Si such that for each p G G, tf'i(p) satisfies G. Then, there are at least 
(|G| — 1) point queries corresponding to gaps in G that also satisfy G. 

Proof: If G has less than 2 elements, then the lemma is trivially true. Assume 
G has at least 2 elements. Let C = G“ A G^ , where G® involves only the left 
endpoint x® and G® involves only the right endpoint a;®. 

Consider the conjunction of comparisons G®. Each comparison is of the form 
(a;® op c), where op is one of the three relational operators, <,>, or =. The 
< and > operators can be rewritten using < and > respectively by modifying 
the constants appropriately. Since G has at least two elements, an equality com- 
parison is not permissible in G®. Let G® = G® A Gf, where the comparisons 
in G® involve only the greater than (>) operator and those in Gi involve only 
the less than (<) operator. The conjunction G® can be reduced to one compari- 
son between a;® and the maximum value of all the constants in its comparisons. 
G® can be similarly reduced by taking the minimum value of its constants. Let 
G® = (a;® > cf) A (a;® < c^). (If G® does not have a comparison of the form 
(a;® > c®), then cf can be set to — oo; similarly, if the comparison (a;® < c^) is 
missing, then c| can be set to oo). 

Now consider the set of points in G. Sort them in ascending order. Consider 
any pair of consecutive points a and b in this sorted sequence. Since G is a 
subset of Si, a and b are odd values. Therefore, there exists a gap g such that 
a < g < b. Next, note that the point queries for a and b satisfy the conjunct G 
and therefore, the conjuncts G® and G®. Consider the conjunct G®. Since Ti{a) 
satisfies G®, we have (a > c®) A (a < c|), i.e., cf < a < c|. Likewise, for b we have 
cf < b < cf. Since a < g < b, it follows that cf < g < cf, i.e., 'Ti{g) satisfies G®. 
A similar argument establishes that Ti{g) also satisfies G®. Therefore, the query 
'Ti{g) satisfies the conjunct G. Next, let us count the number of gaps that satisfy 
G. The number of consecutive pairs in the sorted sequence of G is (|G| — 1). Since 
each such pair is separated by at least one gap, we have at least (|G| — 1) point 
queries corresponding to gaps that satisfy the conjunct G. □ 
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Next, we estimate the number of holes that satisfy cond(y) for any vertex 
V G T. We restrict ourselves to vertices that have at least one data point in their 
subtrees (i.e., D{v) yf (/>). 

Lemma 6 For any vertex v G T , the number of hole queries that satisfy cond(v) 
is at least d{nD{vY^'^ — 1). 

Proof: First, we estimate the number of Fhole queries that satisfy cond{v). Each 
Fhole query corresponds to a gap in the dimension and has an interval 
[1,2k + 1] in all other dimensions. Now, consider cond(v). Since it is expressed 
as cond{v) = Ci A C=2 A . . . , Cd, the Ahole query satisfies cond{v) if its interval 
in the dimension satisfies the conjunct Cj, Vj. 

First, we show that, when j i, the conjunct Cj is satisfied by the corre- 
sponding interval of every Ahole query. Consider any i-hole query. It has a query 
interval of [1,2k -I- 1] in dimension Xj. Since D{v) y^ (j), there exists a point 
p G D(v) and since D{v) C Sd, the Xj-coordinate of the point p, pj, satisfies: 
1 < Pj < 2,k+ 1. Then, from LemmaElit follows that the conjunct Cj is satisfied 
by an interval [1, 2fc -|- 1], i.e., by the interval of any i-hole query. 

Given the above argument, the total number of i-holes that satisfy cond(v) 
is given by the number of them satisfying the conjunct Ci of cond{v). To es- 
timate this, we first note that an i-hole corresponds to a gap in Pi{D{v)), the 
1 -projection set of D(v). Hence, the number of z-holes satisfying cond(y) is given 
by the number of gaps of Pi{D{v)) satisfying Ci. From Lemma 0, this number 
is at least (|Pi(D(u))| — 1). This gives a lower bound on the number of z-holes 
that satisfy cond(v). By summing the number of z-holes that satisfy the condi- 
tion condiy) for all z, the total number of holes that satisfy cond{v) is at least 
F^i<i<d{\Pi{D{v))[ — 1). From LemmaE] this number is > d{[D{v)[^^‘^ — 1). □ 

Lemma 7 A vertex v in a non-replicating index for is accessed by at least 
(d((rz£i(z;))^/‘^ — 1) hole queries. 

Proof: Let vertex zz be a parent of vertex v. Then, D{v) C D(u). Since each of 
these sets is a subset of that has all data coordinates at odd values (and none 
at even values), every gap (which as at an even value by definition) in a Pi(Dy)) 
is also a gap in Pi{Dy)) for all dimensions i. Therefore, it follows that every 
hole-query that satisfies the edge condition of vertex v (cond{v)) also satisfies 
the edge condition of u, vertex v's parent. Extending this argument, it follows 
that every hole query that satisfies cond{v) also satisfies the edge conditions of all 
the vertices from the root to vertex v. The semantics of query processing ensures 
that these hole-queries access vertex v. Since the number of these hole-queries 
is d{{no{v)y^'^ — 1) (from LemmaEJ, the result follows. □ 

Next, we estimate the total complexity of all the hole queries in any non- 
replicating index for the set of points S^. 

Lemma 8 Let T be a non-replicating index for the set of points Sf[. Then, the 
total access cost for all the hole queries in T is 0{{n — o{n))/{b^'^~^P'^)). 
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Proof: We prove this by induction on the height h of the tree T. 

Basis: h=0. Then, the tree consists of one single node - the root with all the 
data in it. Since the capacity of a node is at most b, the number of points n is 
at most b. From Lemma |H1 there are at least — 1) hole queries that access 

this node, the access cost being 1. Since n is 0{b), the cost for all these queries 
is: 



— d = d{n — 1) = d{{n — o{n))/n‘'‘^ = 0{{n — o{n))/{b^‘^ 

Hypothesis: Assume that the lemma is true for all trees of height < 1. 

Consider a tree T of height l+l. Then, the total access cost for all the hole queries 
is given by the cost at the root followed by the cost at the subtrees corresponding 
to its children. Each hole query that accesses the root incurs a unit cost for the 
access. The number of such hole queries is obtained from Lemma 0 Hence, the 
total access cost for all hole queries on tree T rooted at R (denoted by c(R)) is: 

c(i?) = - 1) + ^ c{v). 

v^children{R) 



Since nn{R) = n, we have 

c(i?) = d{n^^‘^ ~ 1) + ^ c(u). 

v^children{R) 

Note that each subtree rooted at any child of R is of height at most 1. Then, 
by induction hypothesis, we have for each child v of R, c{v) > 0{{riD{v) — 
o{riD{v)))/{b^'^~'^'^/'^). Since S^^^hUdren{R)nD{v) = n - \data{R)\, we have for 
some constant m 

c{R) = - 1) + m*[{n — \data{R)\) — 

v^children{R) 

Since the size of the data set data{R) stored at the root is at most b and each 
node in the tree has at most a constant number, r, of children, it follows that 

c(i?) > d{n^/'^ - 1) + m(n — b — r ^ o(n))/(6(^-i)/^) = 0((n - o(n))/(6(^-i)/^)). 

Hence, the total access cost is 0((n — o{n))/{b^^~^^/^)). □ 
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1 Introduction 



In recent years there has been an increased interest in managing data that does 
not conform to traditional data models, like the relational or object oriented 
model. The reasons for this non-conformance are diverse. On the one hand, data 
may not conform to such models at the physical level: it may be stored in data 
exchange formats, fetched from the Web, or stored as structured files. One the 
other hand, it may not conform at the logical level: data may have missing 
attributes, some attributes may be of different types in different data items, 
there may be heterogeneous collections, or the schema may be too complex or 
changes too often. The term semistructured data has been used to refer to such 
data. The semistructured data model consists of an edge-labeled graph, in which 
nodes correspond to objects and edges to attributes or values. Figureni illustrates 
a semistructured database providing information about a city. 

Relational databases are traditionally queried with associative queries, re- 
trieving tuples based on the value of some attributes. To answer such queries ef- 
ficiently, database management systems support indexes for translating attribute 
values into tuple ids (e.g. B-trees or hash tables). In object-oriented databases, 
path queries replace the simpler associative queries. Several data structures have 
been proposed for answering path queries efficiently: e.g., access support rela- 
tions HH and path indexes g]. 

In the case of semistructured data, queries are even more complex, because 
they may contain regular path expressions f1 17I8I1 fi] . The additional flexibility 
is needed in order to traverse data whose structure is irregular, or partially 
unknown to the user. For example the following query retrieves all restaurants 
serving lasagna for dinner: 



select X from {*. Restaurant) x {Menu.* .Dinner.* .Lasagna) y 



Starting at the root of the database DB, the query searches for paths satisfying 
the regular expression *. Restaurant and, from the retrieved nodes x, searches for 
another regular expression. Menu.*. Dinner.*. Lasagna. 

How are such queries evaluated? A naive evaluation that scans the whole 
database is obviously very expensive. As in the case of relational and oo databases, 
we would like to use some indexes to speed up the evaluation. Index structures 
developed for traditional data models are tied to a pre-defined schema: e.g. rela- 
tional databases index on a specific attribute of a specific relation, while object- 
oriented databases index on a specific path in the object-oriented schema |4l 1 4j . 
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e.g. document. section, title. Hence, these index structures are not applicable to 
semistructured data, because here the schema is unavailable. Full text indexing 
systems take an opposite approach: given no knowledge on the structure of in- 
formation, they index all the data. But this is of limited use for semistructured 
data, where some (perhaps very partial) knowledge on the structure may be 
available and exploited in path expressions: e.g. the query above insists that a 
Dinner item appears inside a Menu. 



Recent work has addressed the problem of path expressions evaluation in 
semistructured databases But they focused mainly on deriving and 

using schema information to rewrite queries and guide the search. The issue 
of indexing was almost ignored. An exception are the dataguides of mi which 
record information on the existing paths in a database, using this as an index. 
Dataguides are restricted to a single regular expression and are not useful in 
more complex queries with several regular expressions, like the one above. 

In this paper we propose a novel, general index structure for semistructured 
databases, called template index., or T-index. It improves over the previous ap- 
proaches in several ways. First, T-indexes allow us to trade space for generality. 
The class of paths associated with a given T-index is specified by a path tem- 
plate. For example, we can build a T-index to evaluate paths described by the 

can be replaced by any regular expression (P 



template 



P 



P 



y: here 



P 



stands for “path expression”). The query above is of this form. Another exam- 
ple is {*. Restaurant) x 



P 



y, in which the first regular expression is fixed to 
*.Retaurant this T-index takes less space but is less general. Second, T-indexes 
can be efficiently constructed. Dataguides mi require a powerset construct over 
the underlying database, which in the worst case can be of exponential cost: by 
contrast, T-indexes rely on the computation of a simulation or a bisimulation 
relation, for which efficient algorithms exists. Third, we offer guarantees for the 
size of a T-index. For example the size of a T-index associated to a single regular 
expression is at most linear in that of the database, (again, we contrast this to 
dataguides which, in the worst case, are exponential), and often, as our exper- 
iments show, it is much less. Forth, we show that T-indexes turn out to be an 
elegant generalizations of index structures considered previously in various con- 
texts: dataguides for semistructured data, Pat trees for full text indexes [T? 
and Access Support Relations for OODBs PI- 



Our technique consists in grouping database objects into equivalence classes 
containing objects that are indistinguishable w.r.t to a class of paths defined 
by a path template as described above. Computing this equivalence relation 
may be expensive (PSPACE complete), so we consider finer equivalence classes 
defined by bisimulation or simulation, which are efficiently computable |0|. A T- 
index is built from these equivalence classes, by constructing a non-deterministic 
automaton whose states represent the equivalence classes and whose transitions 
correspond to edges between objects in those classes. 

While each T-index is designed for a particular class of queries (given by one 
template), it can be used to answer queries of more general forms. We address the 
problem of deciding whether a given query with regular path expressions can be 
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rewritten to take advantage of a given T-index. In this formulation the problem 
generalizes the query rewriting problem to the case of queries with regular 
path expressions which, to the best of our knowledge, is still open. Here we have 
a more modest goal: we show that a certain restriction of this query rewriting 
problem is decidable, and, moreover, it is in PTIME for a specific class of queries, 
which is of interest in practice. Even in this restricted form, our result has an 
interesting corollary: the fact that containment of regular expressions consisting 
of concatenations of constants and wildcards is decidable in PTIME. This is 
non obvious, because the associated deterministic automaton in this case is still 
exponential in the size of the regular expression. 

In section l^we review the data model and query language for semi-structured 
data and introduce the notion of path templates. We first consider two specific 
templates in Sections 0 and 0 whose corresponding indexes we call 1 and 2- 
index respectively: we illustrate details of our techniques on these simpler cases. 
General T-indexes are presented in Section 0 We conclude in section 0 




Fig. 1. Example of a semistructured database with small towns in New Jersey. 



2 Review: Data Model and Query Languages 

We review here the basic framework on semistructured databases and queries. 

The data model: Semistructured data is modeled as a labeled graph, in which 
nodes correspond to the objects in the database, and edges to their attributes. 
We assume an infinite set T> of data values and an infinite set Af of Nodes. 

Definition 1. A data graph DB = (V,E,R) is a labeled rooted graph, where 
V C Af is a finite set of nodes; E C V x T> x V is a set of labeled edges, and 
R C V is a set of root nodes. W.l.o.g. we assume that all the nodes in V are 
reachable from some root in R. We will often refer to such a data graph as a 
database. 



280 



Tova Milo and Dan Suciu 



Path expressions: We assume a set of base predicates pi,p2---, over the domain of 
values T>, and denote with T the set of boolean combinations of such predicates. 
We assume that we have effective procedures for evaluating the truth values of 
sentences /(c?) and 3 a:. /(a;), for / G IF and d GT>. 

We define regular path expressions, or path expressions , P, over formulas in 
IF: P ::= 0 I e I / I (P\P) \ (P-P) \ P*- We denote with L{P) the regular 
language defined by P, and with W{P) the set of all words w = ai . . . a„ in V* , 
s.t. there exists a word w' = fi ... fn & L{P) and fi{ai) holds for alH = 1 . . . n 
(i.e. the set of words obtained by replacing each formula by some value that 
satisfies it). It is easy to see that the languages defined by path expressions are 
closed under intersection and that the emptiness problem for W{P) is decidable. 

Given a data graph DB and a path p = vq ^ v\ V2 . ■ . Vn-i ^ Vn in DB, 
we say that p matches the path expression P iff the word ai . . . a„ is in W(P). 

We denote with _ the predicate True, and abbreviate _* with *. Each constant 
d G D also denotes a predicate d s.t. Va; G D, d{x) is true iff a; = d. For example 
*. Restaurant. * .Name. -.Fridays is a path expression. 



Queries: A query path is an expression of the form Pi X\ P2 X2 . . . Pn Xn where 
the Xi’s are distinct variable names, and the Pi’s are path expressions. Given a 
graph database T>B — (V, E, R), we say that the nodes vq,vi, . . . ,Vn satisfy a 
query path Pi xi P2 X2 . . . Pn x„ if fo S i? (is a root) and for all Vi-i,Vi,i = 
1 . . .n, there exist a path from Vi-i to Vi that matches Pi. A query has the form: 

select Xi,^,Xi2, ■ . . , from Pi Xi P2 X2 . . . Pn Xn 

where 1 < < 12 < . . . < ife < n. That is, a query consists of a query path 

and a set of head variables. The query in Section Q has this form. The answer 
of a query is the projection on the indexes ii, . . . ,ik of all tuples (vq, vi, . . . , Vn) 
that satisfy the query path. In the sequel we will assume w.l.o.g. that the head 
variables are either xi, . . . , Xn-i or xi, ..., Xn. In the latter case we will refer to 
the query by giving only the query path. 



Path Templates: A path template t has the form Ti xi T2 X2 . . . T„ Xn where each 
Ti is either a regular path expression, or one of the following two place holders: 
. A query path q is obtained from t by instantiating each of the 



and 



F 



P 



F 



place holders 



place holders by some regular path expressions, and each of the 
by some formula. The query path thus obtained is called an instantiation of the 
path template t. The set of all such instantiations is denoted inst{t). 

For example, consider the path template 
{*. Restaurant) Xi 



X2 Name x^ F 0:4. The following three query paths are 



possible instantiations: 



qi = {*. Restaurant) xi * X2 Name X3 Fridays X4 

q2 = {*. Restaurant) xi * X2 name X3 _ X4 

q3 = {*. Restaurant) xi (e | _) X2 Name X3 Fridays X4 
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Given a path template t, our goal is to construct an index structure that 
will enable an efficient evaluation of queries in inst{t). (In fact, as we shall see 
later, it will also assist in answering other queries as well). The templates are 
used to guide the indexing mechanism to concentrate on the more interesting 
(or frequently queried) parts of the data graph. For example, if we know that 
the database contains a restaurants directory and that most common queries 
refer to the restaurant and its name, we may use a path template such as the 
one above. As another example, assume we know nothing about the database, 
but users never ask for more than k objects on a path. Then we may take 
t = I P |xi[~f ^2 ■ • . I P |xfc, and build the corresponding index. 



In the next two sections we focus on two particular templates t\ = 

t2 = * Xi 



and 



P 



X 2 , and illustrate most technical details on these simpler indexes. 



called 1- and 2-indexes. We present the general case in Sec. 0 



3 1-Indexes 



Our goal here is to compute efficiently queries q G inst{ P x). 



A First attempt: A naive way (which we will soon refine) is the following. For 
each node v in DB, let Ly{DB), or in short, be the set of words on paths 
from some root node to v: 

Ly{DB) {w I w = oi . . . Qn, 3 a path vq ^ ^ v, with vq a root node} 

Next, define the language equivalence relation, v = u on nodes in DB to be: 

V u By — By 



We denote with [u] the equivalence class of v. Clearly, there are no more equiva- 
lence classes than nodes in DB. Language equivalence is important because two 
nodes v, u in DB can be distinguishec0 by a query path in inst{ P x) iS u ^ v. 

A naive index can be constructed as follows: it consists of the collection of 
all equivalence classes si, S 2 , . . ., each accompanied by (1) an automaton/regular 
expression describing the corresponding language, and (2) the set of nodes in the 
equivalence class. We call this set the extent of Si, and denote it by extent(si)- 
Given the naive index, a query path of the form P x can be can be evaluated 
by iterating over all the classes Si, and for each class testing if the language of 
that class has a nonempty intersection with W{P). The answer of the query is 
the union of all extent(si) for which this intersection is not empty. 

This naive approach is inefficient, for two reasons. 



— Construction Cost: the construction of the index is very expensive since 
computing the equivalence classes for a given data graph is a PSPAGE- 
complete problem m- 

^ Distinguished means that one node is in the query’s answer while the other is not. 
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— Index Size: the automaton/regular expressions associated with different equiv- 
alence classes have overlapping parts which are stored redundantly. This also 
results in inefficient query evaluation, since we have to intersect W{P) with 
each regular language. 

We next address these problems. To tackle the construction cost we consider 
refinements. An equivalence relation « is called a refinement if: 

V za u V = u (1) 

As we shall see, any refinement is fine for constructing 1-indexes, as long as 
it is efficiently computable: we illustrate below two examples. The basic idea 
to tackle the index size was introduced in and consists in a more concise 
representation of the languages of si, S 2 , ■ ■ ■, based on finite state automata. A 
novelty here over (E) is the use of a non deterministic automaton to get an even 
more compact structure. 

Refinements: We discuss here two choices for refinements: bisimulation, «&, and 
simulation, Both are discussed extensively in the literature |IVl2lHi:-!j . The 
idea that these can be used to approximate the language equivalence dates back 
to the modeling of reactive systems and process algebras H3|. For completeness, 
we revise their definitions here. Unlike standard definitions in the literature we 
need to traverse edges “backwards” , because Ly refers to the set of paths leading 
into V. 

Definition 2. Let DB be a data graph. A binary relation ~ on its nodes is a 
backwards bisimulation if: 

1. If V ^ v' and v is a root, then so is v' . 

2. Conversely, if v ^ v' and v' is a root, then so is v. 

3. If V ^ v' , then for any edge u — > u there exists an edge u' v' , s.t. u ~ u' . 
4-. Conversely, if v ^ v' , then for any edge v! v' there exists an edge u—^v, 

s.t. u ~ u' . 

A binary relation :< is a backwards simulation, if it satisfies conditions^ and\^ 

Since we consider only backwards simulations and bisimulations in this paper 
we safely refer to them as simulation and bisimulation. 

Two nodes v, u are bisimilar, in notation v u, iff there exists a bisimulation 
~ s.t. V u. Paige and Tarjan 120 ! describe an 0(m log n) time algorithm for 
computing on a unlabeled graph with n nodes and m edges, which can be 
easily adapted to a 0(m log m) algorithm for labeled graphs |S|. 

Two nodes v, u are similar, in notation v u, if there exists two simulations 
■<, :<' s.t. V ^ u and u v. Henzinger, Henzinger, and KopkejTTT] give an 0(mn) 
algorithm for computing on an unlabeled graph with n nodes and m edges, 
which can be easily adapted to an Ofm?) algorithm for labeled graphs |i]. 

We have: v u => u u v = u, hence both and «s are re- 
finements, i.e. satisfy Equation E The implications are strict as illustrated in 
Figure 121 where x = y = z,x^sy^sZ, and x z:. 
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Fig. 2. A data graph on which the relations =, ~s> and differ. 



In constructing our indexes we will use either a bisimulation or a simulation. 
The reader may wonder how much we loose in practice by using a refinement 
instead of =. The answer is: not much. In fact, for tree data graphs the three 
coincide. We prove a slightly more general statement. Let us say that a database 
DB has unique incoming labels if for any node x, whenever a, b are labels of two 
distinct edges entering x, then a yf 5. In particular, tree databases have unique 
incoming labels. 

Proposition 1. If DB is a graph database with unique incoming labels, then =, 
«s, and coincide. 

Proof. Recall that we only consider accessible graph databases, i.e. in which 
every node is accessible from some root. We will show that = is a bisimulation: 
this proves that v = u v u, and the proposition follows. We check the 
four conditions in Definition ^ li v = u and u is a root, then e £ L„, hence 
e G Lu, so M is a root too. This proves items Q and|21 Let v = u and let v' ^ v 
be some edge. Hence = Li.o U L 2 , where Li = while L 2 is a language 
which does not contain any words ending in a (because DB has unique incoming 
labels). It follows that = Li.aU L 2 . Since v' is an accessible node in DB, we 
have Li yf 0, hence there exists some edge u' u entering u, and it also follows 
that Lyl = Lyl. 

1-Indexes: We can now define 1-indexes. Given a database DB and a refinement 
R:!, the 1-index I{DB) is a rooted, labeled graph defined as follows. Its nodes are 
equivalence classes [u] of for each edge v v' in DB there exists an edge 
[u] [u'J in I{DB)- the roots are [r] for each root r in DB. When DB is clear 
from the context we omit it and simply write I. 

We store / as follows. First we associate an oid s to each node in I, and store 
I’s graph structure in a standard fashion. Second, we record for each node s 
the nodes in DB belonging to that equivalence class, which we denote extent(s). 
That is, if s is an oid for [u], then extent(s) = [u]. The space for I incurs two 
costs: the space for the graph I, and that for the extents. The graph is at most 
as large as the data graph DB, but we will argue that in practice it may be much 
less. The extents contain each node in DB exactly once. This is similar to an 
unclustered index in a relational databases, where each tuple id occurs exactly 
once in the index. 
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Evaluating Query Paths with 1-Indexes: We describe now how to evaluate a 
query path P x. Rather than evaluating it on the data graph DB we evaluate 
it on the index graph I{DB). Let {si, S 2 , • • • , Sk} be the set of nodes in I{DB) 
that satisfy the query path. Then the answer of the query on DB is extent(si) U 
extent(s 2 ) U . . . U extent(sfe). The correctness of this algorithms follows from the 
following proposition: 

Proposition 2. Let ^ he a refinement (i.e. satisfies Equation m) on DB. 
Then, for any node v in DB, Ly(DB) — 

Proof. The inclusion C holds for any equivalence relation not only 
refinements: this is because any path vq ^ vi ^ V 2 ... in DB, with vq a root 
node, has a corresponding path fyo] ^ [fi] ^ [^ 2 ] . . . in /. For the converse, we 
prove by induction on the length of a word w that, if w € L[«], then w € Ly. 
When w = £ (the empty word) , then [w] is a root of I : hence v ~ r for some root 
r. This implies Ly = Ly, so e G Ly. When w = w\.a, then we consider the last 
edge in /: s ^ [f], with w\ G Eg. By definition there exists nodes vi € [s] and 
v' € [z;] and an edge vi v' and, by induction, wi € Ly^. This implies w € Lyi. 
Now we use the fact that « is a refinement, to conclude that w £ Ly. 

The cost of evaluating a query P x on a graph is polynomial in the size of 
the graph and that of the query patlH Since I{DB) is likely to be smaller than 
DB, evaluating a query on I{DB) rather than DB is faster. Note that nodes in 
the index graph may have many outgoing edges. This is because an equivalence 
class may contain many nodes, and the outgoing edges of the class node is the 
union of all their outgoing edges. To make the computation faster, these edges 
can be further indexed (e.g. by hashing or using B-tree on the labels) so that 
the selection of edges with specific labels is faster. 



Example 1. Fig. 0(a) illustrates a graph data DB and Fig. 0 (b) its 1-index I. 
Considering the query q = t.a x, its evaluation follows the two paths t.a in I 
(rather than the 5 in DB), and unions their extents: {7, 13} U {8, 10, 12}. 



The Size of a 1-Index: The storage of a 1-index consists of the graph / and the 
sum of all extents. Query performance is dictated by the former, which we discuss 
here. On the experimental side we computed I{DB) for a variety of databases, 
obtaining results which show that in common scenarios I is significantly smaller 
than DB. A brief discussion of the experiments is given in the Appendix. On 
the theoretical side we identified two parameters which alone control the size 
of I. These are: (1) the number of distinct labels in DB, and (2) the longest 
simple patljB We show here that there exists an upper bound for the size of / 

^ We assume here that each predicate p{d) can be computed in 0(1), for each d £T>. 
® A simple path is a path which does not go twice through the same node. 
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Fig. 3. A data graph (a), its 1-index (b), and its strong dataguide (c) 



which depends only on these two parameters and is otherwise independent on 
DB. Technically this is one of the hardest results in this paper, whose proof will 
be included in the full version (omitted here for lack of space), and we believe it 
is valuable in focusing future research aimed at reducing the index size. 

Formally, for a database DB and number fc, we say that DB is “fc-short” 
if there are no simple paths of length > k. For example trees of depth < k are 
A:-short. Some important instances of semistructured databases are in practice 
/c-short, for some small k. Namely many web sites have the following structure: 
they start as a tree of depth d, then add a navigation har, consisting of p links to p 
distinguished pages in the web site: importantly, every page having a navigation 
bar refers to the same set of p distinguished pages. It is easy to see that such a 
database is d + p{d — 1) short. In practice, both d and p are very small, even if 
the web site itself is large. 

Theorem 1. Let DB be a k-short database having at most p distinet labels, 
and let « be any refinement whieh is at least as eoarse as a bisimulatioi^. Then 
the size of I is bounded by some number depending only on k and p, and is 
independent on the size of DB. 

Connection to Related Work: Data Guides - In CHI and CH. the authors pro- 
posed for the first time a method for extracting all the possible path information 
from a given database DB, and describe it as a concise labeled graph called a 
dataguide. In their approach each path in the data is represented exactly once 
in the dataguide. If we view DB as an automaton by making each each node a 
terminal state and each root an initial state, then a dataguide is by definition a 
deterministic automaton equivalent to DB. Among all dataguides only one has 
states which are related to sets of nodes in DB (what we call here extents): this 
is called a strong dataguide, and is precisely the standard powerset automaton 
of DB. Unlike 1-indexes, the extents in a strong dataguide may overlap. Hence, 
the storage for dataguides can be larger than for 1-indexes for two reasons: (1) 
the size of the dataguide graph may be as large as exponential in that of the 
database, while the 1-index is at most linear, and (2) the total size of all extents 
in a dataguide may be as large as exponential in that of the database, due to 

That \s u ~b V 



4 



M « V. 
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overlaps, while for 1-indexes it is exactly the number of nodes in DB. A con- 
tribution of our work is to show that, by relaxing the determinism requirement 
imposed on dataguides, the 1-indexes can be constructed and stored more effi- 
ciently, while at the same time serve similar goals. We pinpoint the relationship 
between dataguides and 1-indexes in the following proposition. (Proof omitted.) 

Proposition 3. Let « be any refinement relation on the nodes of a database 
DB (i.e. ~ satisfies Equation (EPA cifid let I be the 1-index constructed on DB 
using «. Then the deterministic automaton built from I by the standard powerset 
construct coincides with the strong dataguide. 

Referring to the data graph DB in Fig. 0 (a) and index / in Fig. 0 (b), the 
strong dataguide is shown in Fig. 0(c): it is the powerset construct of both DB 
and I . On tree databases however, 1-indexes coincide with strong dataguides 
(because here I is deterministic). 



4 2-Indexes 



In this section we describe index structures for answering queries of the form 
select xi,X 2 from * xi P X 2 , with P a regular path expression: the template is 
* x\ P X 2 - We are interested in pairs of nodes (xi,X 2 ), so we define: 



def 



L(v,u){DB) = {w \ w = Qi . . . Qn, and there exists a path v 



u in DB} 



We write when DB is clear from the context. Now, define two pairs to 

be equivalent, (v,u) = (v',u'), iff and let [(u,m)] denote the 

equivalence class of (v,u). As before, computing = is expensive, so we consider 
(efficiently computable) refinements, «, satisfying: 



(y,u) 



{v',u') {v,u) = {v\u) 



( 2 ) 



As before, there exist efficient refinements based in simulation and bisimu- 
lation. (Details are omitted for lack of space). We define the 2-index I^{DB) of 
DB to be the following rooted graph. Its nodes are equivalence classes, [(u,u)], 
of the roots are all the equivalence classes of the form [(x, a:)]; finally, for each 
edge u-^ u' and each node v in DB there is an edge [(u, u)] [(t;^ u')]. We store 

P as two logical components: the graph itself and the extent extent(s) [(u, rt)], 
for each node s representing the equivalence class [(u,u)]. 

Proposition 0 now becomes: L(^y^u'^{DB) = L[(„_„)](/^(DR)). Node that the 
L(v,u){DB) on the left represent the paths between v and u in DB, while the 
L[{v,u)]{I^{DB)) on the right represents the paths, in the 2-index I^{DB), be- 
tween some root of the index and [(u, u)]. 

Query evaluation with 2-indexes proceeds similarly to that with 1-indexes, 
with small modification: To compute select x,y from * x P y, we compute the 
query path P y on and take the union of the extents. Note that this saves 
the * search: but we may have to start at several roots in I^. Often, these are 
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only a few. For example, in acyclic databases, P has a single root, becaus^fl 
(u,u) = (v,v) for every nodes u,v G DB. Figure 01 shows the 2-Index (without 
extents) for the database in Figure E]( a). It has a single root: the top node. The 
query select a;, y from * a; a y is evaluated by traversing the outgoing a edges of 
that root. 




Fig. 4. A 2-index for the data graph 



As for 1-indexes, the storage of a 2-index consists of two parts: the graph and 
the extents. Both are now (at worst) quadratic in the size of DB. Again, while 
this guarantees that querying the index is no more expensive than querying the 
database, we would like to keep the index as small as possible. Our experiments 
(described briefly in the Appendix) indicate that in practice the index size is 
smaller than this upper bound, thus providing a significant improvement in query 
evaluation. A number of implementation techniques for further reducing the size 
of 2-indexes are also available, but they are beyond the scope of this paper. On 
the theoretical side, TheoremHcan be extended to 2-indexes for obtaining upper 
bounds on the size of the graph of P which are independent on the size of DB: 
we omit this for lack of space. 

Connection to Related Work: Patricia Trees - We conclude this section by men- 
tioning the relationship to full-text indexing mechanisms and in particular to Pat 
trees wm- Its purpose is to assist in computing regular expressions over large 
text flies. A Pat tree is a Patricia tree ECU constructed over all the possible 
suffixes of a text (viewing the text as infinitely long), as follows. The root node 
will have one outgoing edge for each character in the file. Each of its children, 
say that corresponding to the letter k, will have one child for each character fol- 
lowing that letter, e.g. the children may correspond to ka, kb, kc, . . . These nodes 
in turn will have one child for each continuation of that group of two characters, 
etc. If a node has only one child, that child is deleted, and the node is annotated 
with the number of descendents being omitted. The leaves of the tree point back 
into the data, to the beginning of the corresponding strings. 

There exists a close relationship between Pat trees and 2-indexes. To a se- 
quence of characters ai, 02 ,... associate a data graph DB with a single long 
chain: vi ^ V 2 ^ ■ Then the 2-index P for DB is a tree, and the Pat tree 

can be obtained from P by performing some simple optimizations: (1) keeping 
only the x values in the extents, (2) skipping nodes and pointing back to the 

This statement applies also to and «s. 
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data whenever the descendents form a long chain, and (3) keeping extents only 
in leaf nodes. These and other optimizations will be described in the full version 
of the paper. 



5 T-Indexes 



We now turn to general T-indexes. To illustrate with an example from semistruc- 
tured data, consider the repository of cities in Figure E Assume that a high 
percentage of the query mix has the form select from Restaurant x\ R X 2 , 
where R is some arbitrary path expression: that is, the query conforms to the 
template *. Restaurant xi P X 2 - Rather than indexing all the paths, it is more 



convenient to index only those having a Restaurant incoming edge. Another ex- 
ample is the case where most of the information in the database has a fixed, 
pre-defined structure, and only certain components are irregular. For example, 
consider the relation Restaurants{Name, Phone, Menu): Name and Phone have 
a fixed structure while the Menu attribute has a complex structure that dif- 
fers from one restaurant to the other. We want to use standard optimization 
and indexing techniques for the structured parts, and focus our novel indexing 
mechanisms to the Menu part, where the standard ones do not apply. 

For the remainder of this section we fix a template t = T\ x\ T 2 X 2 ■ ■ - Tn Xn, 
where each of the T^’s is either a path expression or a place holder 



or 



F 



We build an index structure, called a T-index, to assist in answering queries 
q G inst(t). Before going into the definition of the index, we would like to point 
out that T-index both generalize and specialize 1 and 2-indexes, in certain ways. 
The generalization comes from the fact that both 1 and 2-indexes are particular 
cases of T-indexes (see below). But T-indexes also specialize 1 and 2-indexes, 
because of the following intuition. Suppose we built a T-index for a template t, 
and then want to evaluate a query Q =select x from P x. We can always use a 
1-index to evaluate Q, but we can use the T-index only if the path expression 
P is in some sense “compatible” with the T 1 .T 2 ■ ■ - Tn path in t: thus T-indexes 
reduce the class of path expressions that can be evaluated. We will discuss below 
how to test whether a given query can be evaluated using a T-index. 



Definitions: We assume that the query binds the n variables x\,. . . ,Xn in that 
order. A partial binding will correspond then to an i-tuple (ui , . . . ,vf) of nodes 
in DB, for some i = l,n. For each such f-tuple we will consider the languages 
L(vj_i,vj) for j = l,i defined in Sec.0 (by convention defined 

in Sec. □). We fix n -I- 1 fresh symbols Si, . . . , Sn not occurring in V. {V is the 
domain of data values from Definition Q) 

Definition 3. Let t = Ti xi . . .Tn Xn be a path template and {vi, . . . ,Vi) an i- 
tuple of nodes in DB, i = 1 . . .n. Tf^y^ y.-^{DB) is the language over the alphabet 
VU {$, Si, . . . , S'n} generated by the regular expression • ■ ■ %-Ri, where 

the Rj’s , j = 1 .. .i are the regular expression below: 
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(path), then Rj = 

(formula), then Rj L{v--i,v) HI?. 

— If Tj = Pj (eonstant), if n W{Pj) ^ 0 then Rj Sj, otherwise 




- IfTj = 

- IfTj = 



P 



F 



For two j-tuples (vi, . . . , Vi) and (ui, . . . , Ui) we define the language-equivalence 
relation, {ui,...,Vi) = {ui,...,uf), iff The 

equivalence class of (ui, . . . , Vi) is denoted [(ui, . . . , Vi)] 

As before two tuples (ui . . . t>„), (rti . . . u„) in DB can be distinguished by 
a query path P\ x\ . . .Pn Xn in instft) iff (ui, . . . , Vn) ^ (ui, • ■ • , m„). The goal 
of the the $ and the new Si symbols is to pinpoint the range of each of the 
path term in the query, (and in particular those that match the constant path 
expressions in the template), and thus determine the assignments of nodes to 
the query variables. This issue will be further clarified below. 

Computing = is expensive, so we consider refinements, «, satisfying: 



(ui, (ui, ...,«*) (ui, ...,Vi) = {ui, ...,Ui) (3) 



and that can be computed efficiently. As for the case of 1 and 2-index, it is 
possible to define efficient refinements using variants of the traditional simulation 
and bisimulation relations. (Details omitted). 

Given k,, the T-index P{DB) for t is the following rooted, labeled graph: 
Nodes - The nodes include all the equivalence classes (w.r.t «) [(ui, . . . , Vi)],i = 
l,n. Also, for each such class we introduce an additional new node which we 
denote [(ui, . . . , Vi)]^. 

Edges - For each i-tuple there is an edge labeled $ from [(ui . . .Ui_i,Ui)]* to 
[{v\ . . .Vi-i,Vi,Vi)], 1 < i < n. Additionally, each Ti in the template t = 
Ti X\ . . .Tn Xn introduces some edges, depending on its structure: 



1. If Ti = P , then for each edge Vi v[ is in DB, P has an edge [(ui . . . Vi-i, 
Ui)] [(ui . . . Ui_i, u')j. Additionally, each [(ui...Mi)j has an edge to 
[(mi . . . Mi)]® labeled by a special e symbol. 

v'i is in DB, P has an edge [(ui . . . Vi-i, 



F 



then for each edge Vi 



2. If P = 

Ui)] A [(ui...Ui_i,u')]®. 



3. If Ti = Pi, then for each node [(ui . . .Ui_i,Ui)] and every u' s.t. „/) n 

W{Pi) ^ 0, P contains an edge [(ui . . .Ui_i,Ui)] ^ [(ui . . .Ui_i,u')]®, where 
Si is a new symbol. 



Root nodes - The roots are all the nodes [(u)] where u is a root of DB. 
Terminal nodes - Unlike graph databases and 1 and 2-indexes, here we distinguish 
terminal nodes: these are all nodes of the form [(ui, . . . , u„)]®. 

Finally, we remove all nodes not reachable from a root or not having an outgo- 
ing path to a terminal node, and associate with each terminal node [(ui, . . . , u^)]® 
the extent containing all tuples in [(ui, . . . , u„)]0 



As in the case of 1 and 2-indexes, when nodes in P have many outgoing edges, we 
can further index their labels. 
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Example 2. Consider the template t = Restaurant*. Menu) x P y. The 
equivalence classes are the following. For single nodes, u, there are exactly two 
classes [(«■)]: the first, si, contains all nodes u reachable from a root via a path 
matching *. Restaurant*. Menu, and the second, S 2 , contains all the other nodes. 
Considering pairs next, the equivalence classes are now sets of pairs (m, v) for 
which M € Si and which have the same language L(^y_ yy, in addition there are 
similar equivalence classes for pairs (u,v) with u G S 2 . 

P has one edge si ^ sf, continued with sf [(m, u)], for u S si, has edges 
[(u,u)] [(u,z;')] for edges u w' in DB and u G si, and finally has edges 

[(u,u)] [(m, f)]®, ending in a terminal state. Note that S 2 has no outgoing 

edges, hence all nodes [m], [(u,u)] with u G S 2 are removed from the graph. The 
resulting T-index looks like a 2-index that considers only the data reachable by 
a *. Restaurant*. Menu path. 

Observe that every path from a root to a terminal node traverses exactly 
n — 1 $-edges. We define to be the language describing paths from 

the root to [{ui, . . . ,Ui)], with the slight modification that the e symbols are 
interpreted as the epsilon moves (i.e. they are omitted from the strings). 

Evaluating Query Paths with T-Indexes: In the simplest scenario the query 
matches the template completely, i.e. q = P\ x\ . . . P„ G inst{t)\ we as- 
sume that each wildcard _ is replaced with (not ($)A not ( 51 ) A ... A not (S'„)), 

Hpf 

since our alphabet becomes now T>U{$, S\, . . . , 5„}. First, let Pq = P{.$ . . . 
where: 

, def / Si when p is a constant 
* \ Pi when Ti is P or F 

Then evaluate the query path Pq x on /*, interpreting the e edges as epsilon 
moves. Since Pq has exactly n — 1 $-signs, all the retrieved nodes are of the 
form [(ui, . . . , u„)]®. The answer to the query is the union of the extents of the 
retrieved nodes. The following guarantees the correctness of this algorithm. 

Proposition 4. (1) Let ~ be a refinement on DB. Then, for every i = 1 . . .n 
and every i-tuple (vi . . . Vi), we have T(^y^ .„y{DB) = „.)]$(/‘(PP)). (2) 

a tuple (vi, . . . ,Vn) satisfies a query q iff W{Pq) yf 0. 

Evaluating More Complex Queries: Sometimes we can use a T-index to evaluate 
queries q ^ instft). We illustrate first with two examples. 

Example 3. Let t = P x {{B.A}*) y C z and q = {{A.Bfi).A y C z. Ob- 
viously q ^ instft), but we can still use /* as follows. First instantiate t to 
p = A X {{B.A)*) y C z G inst{t) (we have instantiated P with A). Then q 
can be expressed as a projection from p, namely as select y, z from p, because 
A.{B.Afi = {A.B)*.A. 
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Example 4- Let t = AxByCz and q = A x B y {C.D) uE v. Again q ^ inst{t). 
Here t has a single instance, p = AxByCz. We can use it to compute a prefix 
of q, namely the variables x and y, then continue to compute u, v with a search in 
the data graph. That is, we rewrite q as: select {x, y, u, v) from p, y {C.D) u E v. 
A subtle point here is that the unused “side branch” of p, namely C z is not 
harmful (it is implied by y {C.D) u). In effect we have replaced some prefix of q 
with an instance of t: we call this prefix replacement. 

The general problem of deciding whether a path query q can be rewritten in 
terms of one or more T-indexes generalizes the query rewriting problem [ I hj to 
regular path expressions. We do not attempt to solve the general problem (for 
regular expressions it is still open) but restrict ourselves to prefix replacement. 

Given a template t and query path q with variables Xi,. . . ,Xn, we define 
a prefix replacement of q w.r.t. t to consists of (1) an instance p G inst{t) 
whose variables are renamed to include a prefix x\,...,Xi of q's variables, in 
that order, and (2) a postfix q' of q containing the variables Xi,...,x„, such 
that the query select {xi, . . . , Xn) from p, q' is equivalent to q. Note that the new 
query consists of a query paths which includes the variables x\, . . . ,Xn, and a 
possible side branch (everything after Xi in p; see Example 0 . Checking whether 
a query path q admits a prefix replacement is PSPACE-hard, by reduction to the 
equivalence problem for regular expressions (which is PSPACE-complete 
R,R' are equivalent iff the query path q = R x has a prefix replacement w.r.t. 
the template t = R' y. In the full version of the paper we prove that one can 
check in P SPACE whether there exists a prefix replacement (and find one, when 
it exists). 

Finally, we consider a particular case of templates and queries which we 
believe to be more frequent in practice. Define a regular path expression to be 
simple if it consists of a concatenation of (1) constants from V, (2) _, and (3) *. 

For example *.A. * .B C is a simple regular path expression. Similarly, define 

a template to be simple if all its constant regular expressions (if it has any) are 
simple. We prove in the full version of the paper that checking/finding a prefix 
replacement for a simple query w.r.t. a simple template is in PTIME. At the core 
of this result lies a Lemma stating that containment of simple path expressions 
can be tested in PTIME. This may come at a surprise, since the deterministic 
automata associated to a simple regular path expression may have exponentially 
many states (proof omitted), hence the traditional containment test of regular 
languages would be much more expensive. Summarizing: 

Proposition 5. Given a template t and a query path Q, the problem whether 
there exists a prefix replacement of Q w.r.t. t is PSPACE complete. When both 
Q and t are simple, then the problem is in PTIME. 



Connection to Related Work: T-indexes are flexible structures which can be fine- 
tuned to trade-off space for generality. They capture 1- and 2-indexes, by taking 
the templates P x and * x P y respectively. They also generalize traditional 



relational indexes: assuming the encoding of relational databases as in 0) ^^n 
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index on attribute A of the relation can be captured with the template 
(Rl.tup) X A y F z. Finally, they generalize path indexes in OODBs. For 
example Kemper and Moerkotte describe in m access support relation (ASR), 
an index structure for query paths in OODBs. ASR’s are designed to evaluate 
efficiently paths of the form 0.A1.A2 . . . where o is an object and Ai, . . . , A„ 
are attribute names. They define an access support relation, ASR, to be an 
n + 1 -ary relation R such that {u,u\,U2, ■ ■ ■ ,Un) G i? iff there exists a path 

It ^ ui ^ U2 . . . Mn in the database. Ignoring the mismatch between the object- 
oriented and the semistructured data model, there exists a close relationship 
between an ASR and the T-index for the template * x A\ x\ A2 X2 ... A„ 

The graph structure of the T-index would be a chain of 2 n nodes [(r)] — > [(r)]* ^ 
[(u, Ml)] — >■ [(m. Ml)]* ^ [(m. Ml, M2)] ^ ^ [(m, Ml, M2, . . . , M„)]*, where the last 

(terminal) node has an associated extension: this extension is precisely the ASR. 



6 Conclusions 

We presented an indexing mechanism, called T-index, aimed to assist in eval- 
uating query paths in semi-structured data. A T-index captures the (possibly 
partial) knowledge about the structure of data and the type of queries in the 
query mix, as described by a path templates. 

Abiteboul and Vianu consider in | 3 | First-Order equivalence classes over tu- 
ples of values in the database. Two tuples {x\ , . . . ,Xn) and (?/i , . . . , pn) are equiv- 
alent if they are indistinguishable by any FO formula. The language equivalences 
on which we base our index constructs are only superficially related to the FO 
equivalence classes: in our setting equivalence classes are distinguished by path 
queries, while in their setting they are distinguished by first order formulas. 
Hence the language equivalences are coarser than FO equivalences, and result 
in fewer equivalence classes. Buchsbaum, Kanellakis, and Vitter consider in 
the problem of incrementally maintaining query paths given by a fixed regular 
expression under either database insertions or deletions (but not both) . They de- 
scribe an efficient method for incremental updates. Since their method refers to 
a fixed regular expression, it could be used in incremental updates of T-indexes 
but only when the template is restricted to constant regular expressions. We 
do not address index maintenance here, but note that a possible alternative to 
incremental maintenance can be based on an optimization technique, presented 
in the full version of the paper, of pointing back to the data, doing so whenever 
a portion of the index graph is invalidated by an update. 

Acknowledgment: We thank Micky Frankel for the implementation of the 1 -, 2 - 
and T-indexes. 
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A Appendix 

Experiments: Recall that index storage consists of two parts: the graph and the 
extents. We argued that the graph size is critical for query performance, and 
we are currently conducting a series of experiments to asses its size. Some of 
the results are reported in Tabled The Bibtex data is a relatively structured 
one, while the Web site (of the CS department of Tel-Aviv University) is far less 
structured. We also considered randomly generated graphs, and mixed graphs 
composed of both structured and unstructured components. We briefly describe 
these experiments here. In order to asses the schematic information we measure 
only the the number of non-leaf nodes in the graphs. 



Data 


Graph Size 


1-index 


2- index 


Bibtex 


150 


40 


50 


Web site 


1521 


198 


1100 



Table 1. Experiments showing index size 



We started by considering 1 and 2-indices. Not surprisingly, the smallest 
indices were obtained for the BibTex data: although the structure of BibTex 
items may vary (hence a collection of such items is naturally modeled by the 
semi-structured data model), the number of possible paths between nodes is 
rather limited. We considered increasingly growing flies and their corresponding 
graph representation. Already at 150 nodes the size of the 1 and 2-indices almost 
stabilized having about 40 and 50 vertices resp., staying at about the same size 
regardless of the growth of the data, and thus providing significant performance 
improvement when querying large flies. Observe that the independence of the 
index size from the data size is also implied here from Theorem ^ but the 
experimental results show that in practice the index size is much smaller than 
the theoretical upper bound induced by the proof of the theorem. 

The Web example is far less structured: many pages at the site are each built 
and maintained individually by distinct people without coordinating the struc- 
ture. Hence the structure is rather loose and makes the site a typical example 
for semi-structured information. For a graph of about 1500 nodes, the size of 
the 1-index amounts to about 13% of the original size, and that of the 2-index 
to about 72%. Observe that the later is only 0.0475% of the potential upper 
bound on the size of the 2-index, which is the square of the number of nodes 
in the graph ! This also implies that the effort in evaluation of queries of the 
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form * x\ P X2 on the original data can potentially be as much as square of that 
needed when using the 2 -index. (Since on DB we need to evaluate the query 
from each node, while on / we just evaluate P X2 from the Ps root.) 

The usage of T-indices for focusing on specific, more interesting, parts of 
the data was tested on mixed graphs combining randomly generated subgraphs 
with BibTex or Web site-like data, and using templates focusing on the Bib- 
Tex/Web parts. The reduction is size was similar to the one reported above and 
more, depending on the size of the random-generated parts being ignored in the 
construction due to the given template. 
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Abstract. With the emergence of the Web as a universal data reposi- 
tory, research has recently focused on data integration and data transla- 
tion, and a common data model of semistructured data has been estab- 
lished. It is being realized, however, that having a common schema model 
is also necessary, to support tasks such as query formulation, decompo- 
sition and optimization, or declarative specification of data translation. 
In this paper we elaborate on the theoretical foundations of a middle- 
ware schema model. We present expressive and flexible schema definition 
languages, and investigate properties such as expressive power and the 
complexity of decision problems that are signiflcant in the context of 
data translation and integration. 



1 Introduction 

The Web is emerging as a universal data repository, offering access to sources 
whose data organization varies from strictly structured databases to almost com- 
pletely unstructured pages, and everything in between. Consequently, much re- 
search has recently focused on data integration and data translation systems 
[HElElElirZliniEllE], whose goals are to allow applications to utilize data 
from many sources, with possibly widely varying formats. These research efforts 
have established a common data model of semistructured data, for uniformly rep- 
resenting data from any source. Recently, however, it is being realized that having 
a common schema model is also beneficial, and even necessary, in translation and 
integration systems to support tasks such as query formulation, decomposition 
and optimization, or declarative specification of data translation. 

As an example, which we use for motivation throughout the paper, recently 
suggested tools for data translation laimiini use the semistructured data model 
as a common middleware data model to which data sources are mapped. Trans- 
lation from source to target formats is achieved by (1) importing data from 
the source to the middleware model, (2) translating it to another middleware 
representation that better fits the target structure, and then (3) exporting the 
translated data to the target system. In El Hi, schema information is exten- 
sively utilized in this process. Source and target formats are each represented 

* The work is supported by the Israeli Ministry of Science and by the Academy of 
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in a common schema language; then a correspondence between the source and 
target schemas can be specified in the middleware model. In many cases, the 
two schemas are rather similar, with differences mostly due to variations in their 
data models; after all, they both represent the same data. Consequently, most 
of the correspondence between the two schemas can be determined automati- 
cally, with user intervention required only for endorsement or for dealing with 
difficult cases. Once such a correspondence is established, translation becomes 
an automatic process: Data from the source is imported to the common data 
model and is “typed”, i.e. matched to its schema, thus its components are as- 
sociated with appropriate schema elements. Next, the schema correspondence is 
used for a translation of the data to the target structure, where each component 
is translated according to its type. Then data is exported to the target 

The above idea was in fact implemented in the TranScm translation system, 
whose the architecture and functionalities are described in In the present 
paper we elaborate on the theoretical foundations of the middleware schema 
model. (The model has only been informally sketched in and its properties 
have not be investigated there.) While we refer to the TranScm system for moti- 
vation, our results are relevant to other translation systems (e.g. YAT El), and 
to other applications of semi-structured data such as data integration. Our main 
contribution is the presentation of schema definition languages that are both ex- 
pressive and flexible, and an investigation of properties such as expressive power 
and the complexity of significant decision problems for the context of data trans- 
lation and integration. Our schema languages are considerably more expressive 
than those found in recent work on schematic constraints for semi-structured 
data (e.g. 13 CS!), providing a combination of expressibility and tractability. An- 
other contribution here is the extension of the data model to support order in 
data, not supported by the common semistructured data model, both in the data 
and in the schema level. The closest to our model is YAT im, supporting order 
and an expressive schema language, but lacking some of the features described 
here. (They also do not mention expressiveness and complexity results.) Further 
comparison is presented in the last section. 

The paper is organized as follows. In Section we present our data model 
and two schema specification languages. We first introduce a basic schema spec- 
ification language and illustrate its flexibility and expressive power, then we 
discuss an extension that is useful for data translation, and show it increases the 
expressive power. Relationships between the languages and context-free/regular 
grammars are established. Detection and removal of redundancy from schemas is 
discussed in Section El the closure of schemas under operations in Section EJ and 
the complexity of matching of data to schemas and related problems in Section 
O In Section El we present a preliminary study of the computation of schema 
information for queries and its use for optimization. Section 0 summarizes the 
results. All figures appear in the Appendix. 
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2 Data and Schemas 

We define here the middleware data model, introduce our first schema spec- 
ification language and illustrate its capabilities. Then we explain an extension 
useful in data translation. Close relationships between the languages and context- 
free/regular grammars are established; these are extensively used in the sequel. 



2.1 The Data Model 

The common model for semi-structured data E3E1E1 represents data by trees 
or forests, augmented with references to nodes. Thus, it is essentially that of 
labeled directed graphs. This, with the significant addition of order, is also our 
model. More differences and extensions are explained below. 

Definition 1. Let V be an infinite set of label values. A data graph G = 

{V, E, L,0,uj) is a finite direeted node-labeled graph, where V and E C V x V 
are finite sets of nodes and edges, respeetively; L :V is the labeling function 
for the nodes; O C V is the set of ordered nodes; and ui associates with each 
node in O a total order relation on its outgoing edges. 

A rooted data graph is a graph as above, with a designated root node, 
denoted vq, and where each node in the graph is reachable from vg. 

Like in all models in the literature, node labels in our model can be used both to 
represent data such as integers and strings, and to represent schematic informa- 
tion, such as relation, class, and attribute names. While it might seem natural 
to represent schematic data as edge labels, and ‘real’ data as node labels, for 
convenience usually only one of the two kinds of labels is used. For our work, 
node labels seem to have an advantage. It is easy to convert edge labels to node 
labels: create a node to represent the edge, and move the label to it. Thus, our 
results apply to edge-labeled graphs as well. 

Allowing an order to be defined on the children of some of the nodes is a 
significant extension, missing in most previous models. Order is inherent in data 
structures such as tuples and lists, and is essential for representing textual data, 
either as sequences of characters or words, or on a higher level by parse trees. As 
shown in |2j , supporting order in the data model enables a natural representation 
of such data. By allowing order to be defined for part of the nodes we are able 
to naturally describe data containing both ordered and unordered components. 

Reachability of data from one or more roots is a de-facto standard in essen- 
tially all data sources, from database systems to Web sites. In the sequel, we 
are mainly interested in rooted graphs; non-rooted graphs are used mainly for 
technical reasons. For brevity in the sequel a data graph will mean a rooted one 
(unless explicitly stated otherwise). For simplicity we assume here a single root, 
but all subsequent results can be naturally extended for the case of multiple 
roots. 
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2.2 Schemas 

We now present the basic Schema Definition Language, ScmDL. A schema is 
a set of type definitions. Following the SGML/XML syntax for element defini- 
tion the definition of a type (schema element) has two parts. The first is 

a regular expression describing the possible sequences of children that a node 
can have. We use common notation for regular expressions: e denotes the empty 
language, and • and | denote concatenation and alternation resp. The second 
part contains attributes that describe properties of a node. Three of these are 
boolean fiags, that determine if instance data nodes of this type (i) are ordered, 
(ii) can be referenced from other nodes; only such nodes can have more than one 
incoming edge, (iii) can be roots of the data graph. The fourth attribute is a 
unary predicate, that determines the possible labels of data nodes of this type. 
We do not use any specific predicate language. We assume given some collection 
V of predicates over V, the domain of label values, that is closed under the 
boolean operations, and such that satisfiability of a predicate, and whether p{d) 
holds, for a predicate p and a label d, can be decided in time polynomial in d and 
p’s definition^ We also assume that: (a) For base types such as Int, String, V 
contains a corresponding base predicate satisfied by the relevant subset of T> (b) 
For each constant d €V there is a predicate satisfied only by d (i.e. \x.x = d), 
denoted by is-d, where in the case of strings, the quotes are omitted: e.g. is-foo 
denotes the predicate satisfied only by “foo” . (c) Finally, to capture cases where 
no restrictions are imposed on the labels, we assume that V also contains a 
predicate, denoted Dom, that is satisfied by all the elements in T> (i.e. Xx.true). 

Definition 2. Let T he an infinite set of type names. A type definition has 
the form 

t — (^labcl — tlabel^ Ord — tord-, ‘^^f — t^^ef , rOOt — t^oot) 

where t G T,Rt is a regular expression over T, Label is a predieate name in V, 
and tordArefTtroot OLre boolean values. (For additional conventions and abbrevi- 
ations, see below.) 

A ScmDL schema is a pair S = (T, Def), where T G T is a finite set of type 
names, and Def is a set of type definitions containing precisely one definition 
for each of the type names in T and using only type names in T.0 

We use Lang{t) to denote the regular language defined by R±, and we use t.a, 
for a G {label, ord, root, ref}, to denote the value of the attribute a in t’s def- 
inition. We say that a type t is ordered (is a root, is referencable) iff t.ord 
{t.root,t.ref , resp.) is true. 

It can be seen that we follow the SGML/XML style of definition cap , as 
shown in Figure P in that a type definition describe both the attributes of nodes, 
and their children. We chose a less verbose style, for brevity. 

^ This assumption is used in deriving some of our complexity results; the consequences 
of dropping it on these results are quite obvious. 

^ In pa schemas are presented graphically, as the system is oriented towards a graph- 
ical interface for defining schemas and representing schemas and data. 
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<\ELEMENTt (Rt) > 

<\ATTLIST t 

label tiabei ^Required 
ord Boolean lord 
ref Boolean tref 
root Boolean troot > 



Fig. 1. SGML/XML-style type definition 



A ScmDL schema defines a set of data instances. Intuitively a data graph G 
conforms to a schema S if each of the nodes v G G can be assigned a type t s.t. 
V satisfies the requirements of t, as described in t’s definition. This is formally 
defined below. 

Definition 3. Let G be a data graph and S be a schema. We say that G con- 
forms to S (or is an instance of S ), if there is a type assignment on its nodes, 

i.e. a total mapping h from the nodes in G to types in S s.t. for each node v G G 
the following holds: 

1. v’s label satisfies h{v).label, 

2. V is ordered iff h{v) is ordered, 

3. If V is a root node then h{v) is a root type, 

4-. If V has more than one incoming edge then h{v) is referencable, 

5. Finally, if v is ordered and its children are v\, . . . ,Vn (in that order) then the 
sequence h{vi), . . . , h{v„) is a word in Langft), and if v is unordered then 
there exists some order on its children for which the above holds. 

The set of all data graph instances of a schema S is denoted inst{S), and we 
say that two schemas S,S' are equivalent if inst{S) = inst(S'). 

Notational conventions: To further simplify type definitions, we assume the 
convention that flags whose value is false are omitted, while those that are true 
are represented by their name. Thus 

t = Rt {label = tiabeuord) 

represents a type for which ord = true, ref = false, root = false. Another 
convention is used for types of leaf node, i.e. nodes that have no outgoing edges, 
hence their regular expression is e. Almost always, for leaves all three flags are 
false. Further, in essentially all cases, the labels of such nodes are values of base 
types, such as Int, String, that have a corresponding predicate. Hence, from 
now we consider type definitions for such leaves as given, as in the following 
definition, 



Int = e {label = Int, ord = false, ref = false, root = false) 
and freely use Int, String, as type names in schemes without defining them. 
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2.3 Examples 

Descriptive power is an obvious requirement from a middleware schema language. 
One should be able to describe the strict and precise formats of database sources, 
the variety of formats used in scientific disciplines, the very loose formats of Web 
pages, and also cases where the data has parts that are precisely specified, and 
other parts that are allowed to vary. We now illustrate the expressiveness and 
flexibility of the language in describing possible structures of data. 

An OODB schema and its instance database are shown in Figures |2| and 
0 respectively. The corresponding ScmDL schema and data graph are shown 
in Figures 0 and IHl Note that article is the root, authors is the only ordered 
node, and that the data graph conforms to the schema, with the natural type 
assignment. 



class article public type 

tuple (title : string, 

authors : list(author) , 
contact jauthor : author, 
sections : set(string) ) 

class author : string-. 

Fig. 2. OODB Schema 



class name 


old 


value 


article 


oo 


tuple(title : "From Structured Documents to Novel Query Facilities”, 
authors : list(oi, 02 , 03 ), 
contact .author : oi, 

sections : set (’’Structured documents ) 


author 


Ol 


V. Christophides 


author 


02 


S. Abiteboul 


author 


03 


S. Cluet 



Fig. 3. An OODB Instance 



A DTD for an SGML document and its instance document are presented in 
FiguresElandQ respectively. The corresponding ScmDL schema and data graph 
are presented in Figures 0 and 0 All the internal nodes are ordered, with article 
being the root. 

The above were examples for homogeneous, strictly structured data. The 
schema in Figure E] illustrates the opposite case, where no constraints at all 
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article = title -authors -contact -author- sections (label = is-article, root, ref) 



title = 


String 


authors = 


author* 


author = 


String 


contact -author = 


String 


sections = 


String* 



(label = is-title) 

(label = is-authors, ord) 
(label = is-author, ref) 
(label = is-contactjauthor) 
(label = is-sections) 



Fig. 4. SCMDLschema for the 00 data graph 




Fig. 5. Data graph of an OODB article 



<\DOCTYPE article [ 
<\ELEMENT article 
<\ELEMENT title 
<\ELEMENT author 
<\ELEMENT contact-author 
<\ELEMENT section 



(title, author* , contact -author, section*) > 
(if PCDATA) > 

(UPCDATA) > 

(if PCDATA) > 

(if PCDATA) > 



Fig. 6. SGML DTD 



< article > 

< title > Prom structured Documents to Novel Query Facilities < /title > 

< author > V. Christophides < /author > 

< author > S. Abiteboul < /author > 

< author > S. Cluet < /author > 

< contact jauthor > V. Christophides < / contact -author > 

< section > Structured documents are central... < /section > 

. . . < /article > 



Fig. 7. SGML Document 
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article = 


title-author* 


■ contact muthor- section* {label 


title = 


String 


{label 


author = 


String 


{label 


contact jmthor = 


String 


{label 


section = 


String 


{label 



is-article, ord, root) 
is-title, ord) 
is-author, ord) 
is-contactjcmthor, 
ord) 

is-section, ord) 



Fig. 8. SCMDLschema of an SGML article 



article 




Fig. 9. Data graph of an SGML article 



are imposed on data graphs. (Recall that Dom is the label predicate Xx.true.) 
It is easy to see that all data graphs conform to the above schema with a type 
assignment h assigning anyord to all the ordered vertices and anyunord to the 
unordered ones. 

anyord = {anyord \ any^nord)* {label = Dom, ord, root, ref) 
anyy^riord — {anyord \ any-onord) {label — Dom, root, ref) 

Fig. 10. Schema for arbitrary graphs 



Finally, we present an example for the modeling of partial constraints. As- 
sume that we are modeling a web site with arbitrary structure, but where one 
of the links leads to an OODB describing articles, with structure as described 
above. The schema for such mixed data is presented in Figure 1771 (W.l.o.g. we 
assume that all the nodes representing Web pages are ordered, reflecting the 
order of elements in the page.) 

Note that in j^j, schema information was captured using schema- graphs, i.e, 
graphs whose nodes represent types and whose edges describe the possible chil- 
dren that a node of a certain type may have. We argue that ScmDL is powerful 
enough to express all the schema constraints expressible by the schema-graphs 
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mixed = any* .{db \ mixed). any*. 

any = any* 

db = article* 

article = defined as in Figure^ 



{label — Dom, ord, root, ref) 
{label — Dom, ord, ref) 
{label — Dom) 

(...) 



Fig. 11. Schema for a Web site containing an articles OODB 



of 0 and much morefl We can require a node to have a certain outgoing edge, 
while schema-graphs only allows to state that such an edge is possible. We also 
deal with order and referencebility constrains, while they don’t. The following 
proof is omitted for space reasons. 

Proposition 1. Every schema-graph has an equivalent ScmDL schema, hut not 
vice versa. 

2.4 Virtual Nodes and Types 

Typically, in a translation scenario, the target schema is different from that of 
the source. One common difference is that in the source the grouping of data 
into subtrees/subgraphs is less (or more) refined than in the target. For example, 
assume we want to convert the SGML document in FigureQinto an 00 format, 
as in Figure 0 In both formats an article consists of four logical components, 
namely the title, the authors list, the contact author, and the sections. However, 
while in the 00 data graph (Figure 0) each of these parts occurs in a subtree 
under an appropriately named node, this does not hold in the data graph of 
the SGML document (Figure 0, which is just a sequence of elements. Note 
that the refined structure (implicitly) exists in the parse tree of the document, 
relative to the regular expression defining it. For facilitating the translation, it 
is convenient to make the logical structure explicit and split the sequence into 
its logical sub-components. That is, rather than looking at the “real” SGML 
instance in Figure 0 it is convenient to use a “virtual” graph with structure as 
in Figure El remembering that the authors and sections are “virtual” vertices 
that do not really occur in given the data. 

The first step is to define virtual graphs and schemas, and extend the notion 
of conformity accordingly. Observe that, since virtual nodes represent logical 
components that do not actually occur in the data, they cannot be serve as 
roots or be referenced (i.e. have more than one incoming edge). Also, since virtual 
nodes are introduced to represent subsets or subsequences of the children of a 
given node, if the node is (un)ordered so should be the virtual types used in its 
definition. 

Definition 4. A virtual data graph is a data graph with some nodes marked 
as virtual. A virtual schema is a schema as in DeM with one more boolean 
attribute, virtual, in the type definitions, such that: 

® As already mentioned, the fact that they use edge labels while we use node labels 
does not prevent a meaningful comparison. 
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article 




Fig. 12. Virtual version of the data graph of an SGML article 



(i) a virtual type is neither referencable nor a root, 

(a) if a virtual type ty oceurs in the regular expression associated with a type t, 
then t, ty are both ordered or both unordered. 

The extended schema definition language is denoted VScmDL. A virtual graph 
Gy v-conforms to a VScmDL schema S if there exists a type assignment h 
satisfying all the conditions in Definition 0 and additionally v is virtual, iff so 
is h{v). 

Note that by the definition, a virtual node in an instance data graph is not the 
root, is not referencable, and is ordered iff its parent (if one exists) is. Hence we 
will consider in the sequel only virtual graphs where all the virtual nodes have 
this property. We refer to data graphs without virtual nodes, to types whose 
virtual flag is false, and to schemas that contain only such types (equivalently 
ScmDL schemas) as real. 

We can now state the relationship between real data graphs and virtual 
graphs and schemas. 

Definition 5. Let G and Gy be a real and a virtual data graphs, resp. We say 
that Gy is a (possible) virtual version of G if G can be obtained from Gy 
by identifying each virtual node with its closest non-virtual ancestor, gluing its 
children directly to this ancestor, preserving the order of the children, and using 
the label of the ancestor for the resulting node. 

Definition 6 (conformity - revisited). A real data graph G conforms to a 

VScmDL schema S if some virtual version Gy of G v-conforms to S by some 
type assignment h. This h is called a virtual type assignment for G; its 
restriction to nodes of G is called a type assignment for G. 

The set of virtual graphs that v-conform to S is denoted vjinst{S), the set 
of real graphs that conform to it is denoted inst{S); two schemas S,S' are (v-) 
equivalent if inst{S) = inst(S') (resp. v-inst{S) = v-inst(S')). 

Continuing with the above example, the virtual graph in Figure El is a pos- 
sible virtual version of the data graph in Figure 0 Thus the SGML article of 
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Figure 0 conforms to the following virtual schema due to this virtual version, 
with the natural (virtual) type assignment. 

article = title-authors-contactjiuthor-sections {label = is-article, or d, root) 
authors = author* {label = is-authors, ord, virtual) 

sections = section* {label = is-sections, ord, virtual) 

Note that if a VScmDL has no virtual types (i.e. is a real schema, or equiva- 
lently, a ScmDL schema) then conformity and v-conformity coincide, and so do 
inst{S) and v-inst{S) 

Recalling the above SGML-to-OODB translation process, we can see that the 
introduction of virtual types facilitates the comparison of the source and target 
schemas, and the correspondence between their types. (For additional illustra- 
tions and translation examples see m-) Similarly, the introduction of virtual 
nodes into the source data facilitates its matching to its schema augmented with 
these virtual types, and hence the translation of the data. Finding where such 
virtual nodes are needed is part of the conformity test. The complexity of this 
and related problems are discussed in the following sections. Observe that, in 
general, a data graph may have more than one possible virtual version and (vir- 
tual) type assignment. We shall also consider this issue in the sequel and present 
conditions under which a unique type assignment is guaranteed. 

2.5 Schemas and Grammars 

Real schemas associate types with regular grammars. It turns out that the exten- 
sion of schemas to include virtual types, with the refined notion of conformity, 
extend the expressive power of the language. Denote by CF-ScmDL a variant of 
ScmDL, also without virtual types, where types can be described by context free 
grammars, rather than by regular expressions, and similarly use VCF-ScmDL 
for such schemas with virtual types. The semantics is defined exactly as in Def- 
initions 0 and El except that now Lang{t) is in general a context free language. 

Proposition 2. Every VScmDL schema has an equivalent GF-ScmDL schema. 
Conversely, every C'F'-ScmDL schema has an equivalent VScmDL schema, but 
not necessarily an equivalent ScmDL schema. 

The intuition here is based on the analogy between virtual nodes of a graph 
and internal nodes of derivation trees for CF grammars. Thus the extension of 
ScmDL to VScmDL adds expressive power. In contrast, enriching CF-ScmDL 
with virtual types does not extend the expressive power. 

Proposition 3. Every FCF-ScmDL schema has an equivalent GF-ScmDL 
schema. Moreover, every ('F^C'F-ScmDL schema has an equivalent GF-ScmDL 
Schema where all unordered types are associated with regular expressions. 

An immediate consequence of the second part of the Proposition (proved using 
Parikh’s theorem [T^l is that Proposition 0 can now be refined: “Every VScmDL 
schema has an equivalent GF-ScmDL where all unordered types are associated 
with regular expressions” . Thus the use of virtual types adds expressive power 
only in describing ordered types. 
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3 Redundant Types 

Since schemas allow the description of optional data, a type assignment for a 
given data graph may not use all types in a schema, and type assignments to 
different graphs may use distinct types. However, it may be the case that a 
schema contains some types that are redundant — they are not used by any 
type assignment. For example, consider the following schema. 

r = t\q {label = Dom, root) 
q = e {label = Dom) 
t = t {label = Dom) 

It is easy to see that it has only one possible instance having a root of type r 
with one child of type q. The other alternative, namely having a child of type 
t, is impossible since t requires a child of type t, and so on till infinitum. Such 
an infinite chain of nodes of type t can exist in a finite graph only in a cycle in 
which an edge points back to a previous t node in the chain. But nodes of type 
t are not referencable, hence this is not allowed. It follows that t is redundant. 

Formally, a type t is redundant in a schema S if there is no instance G 
of S having a virtual type assignment that uses t. Clearly one would like to 
avoid having redundant types in a schema - they make the schema larger than 
necessary, hence more complex to understand and handle. We prove that type 
redundancy can be treated and eliminated efficiently, hence from now we assume 
that schemas contain no redundant types. 

Theorem 1. Every (V)ScmDL schema S can he transformed to an equivalent 
(V)ScmDL schema S' with no redundant types, in time polynomial in the size 
ofS. 

Corollary 1. Given a schema S, it is possible to decide if inst{S) is empty, in 
time polynomial in the size of S. 

4 Closure Properties 

In data integration one imports data from multiple sources, each described by its 
own schema, and it is often desirable to describe all data using a single schema. 
For simple scenarios, this requires performing on schemas operations that re- 
flect the basic set theoretic operations on the data. We consider the following 
two questions: Given two VScmDL schemas 81,82, does there always exist a 
VScmDL schema Sop, such that 

1. inst{Sop) = inst{Si) op inst{S2), where op is one of U, IT, — ? 

2. vJnst{Sfp) = vJnst{Si) op vJnst{S2), for op in U, T, — ? 

We will refer to the first as closure under op and to the later as closure under 
virtual op. We note that closure under (virtual) difference implies closure under 
(virtual) complement since we have shown that ScmDL can define a schema with 
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all data graphs as its instances (and a similar VScmDL schema can be defined for 
virtual instances). Also, since by PropositionOlone can test emptiness of schemas, 
computable closure under difference implies that testing for the equivalence and 
the containment of schemas (hence schema subsumption) is possible. 

Proposition 4. . 

(1) VScmDL schemas are closed under union, hut not under intersection, sub- 
traction and complement. If one of them is real, then they are closed under 
intersection. 

(2) ScmDL schemas (i.e. schemas with no virtual types) are closed under all 
these operations. 

(3) Finally, all schemas are closed rtnder virtual union, intersection, subtraction 
and complement. 

Note that in here we need our predicate language to be closed under boolean 
operations. Also, observe that for ordered types the results follow from closure 
properties of context-free and regular languages, but for unordered types we use 
the theory of semi-linear sets m 

Initial results regarding schema closure under more general query operations 
are presented below. 

5 Matching Data and Schema 

Given a (V)ScmDL schema S, we call the problem of testing if a data graph G 
conforms to it the (v-)matching problem. Finding a corresponding virtual version 
(for the case of VScmDL schemas) and the (virtual) type assignment, if one 
exists, is called the type derivation problem. We are interested in the complexity 
of the problems as a function of the size of the input data graph G. 

Theorem 2. The matching, v-matching and type derivation problems are NP- 
complete. The problems remain NP-complete even if the nodes in the graph are 
all ordered (or all unordered). 

Proof, (sketch) For the upper bound, guess a type assignment for the nodes. 
Testing the local properties for each node is easy. For the relationship with its 
children, convert a VScmDL into a CF-ScmDL schema. Then, for each node, use 
the parse tree of the word for deriving the virtual nodes and their assigned types. 
(For real schemas, this step is simpler). Completeness is proved via reduction to 
satisfiability of 3NF formulas. 

Despite this result, in many practical cases the above problems can be solved 
in polynomial time. We present some examples below. 

Proposition 5. If the data graph is a tree then the (v-)matching and type 
derivation problems for it can be solved in a polynomial time. 

Proof, (sketch) We determine all possible type assignments for each node, going 
up from the leaves. For unordered nodes we use dynamic programming to avoid 
considering all the possible orders of the edges. 
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Non-tree data can also be handled efficiently in some cases. One case is the 
class of schema-graphs of (3, which, as shown in the proof of Proposition Q 
have very simple syntactic characterization in ScmDL. Following 0 we have 
that the matching and the type derivation problems for this class can be solved 
in polynomial time (details omitted). Next, we consider in more detail another 
case: 

Definition 7. We say that an OF-ScmDL schema S is tagged if for every 
type name t in S, for any two types ti,f 2 in Lang(t), their label predicates are 
disjoint (i.e. cannot he satisfied by the same label value). 

A VScmDL schema is tagged, if its equivalent CF-ScmDL schema S' , as 
constructed in the proof of Proposition n is tagged. 

Observe that relational and object oriented databases, and in general strongly 
typed databases, with homogeneous bulk types and tagged union are all naturally 
described by tagged (real) schemas. 

Proposition 6. For tagged schemas the (v-)matching and the type inference 
problems can be solved in polynomial time. 

Proof, (sketch) Here the algorithm works from the root down to the leaves. 
Again, unordered nodes pose a difficulty, since one needs to avoid considering 
all the possible orders for the set of children, which is addressed using dynamic 
programming. 

Conformity of a data graph to a schema implies there is at least one type 
assignment h for its nodes, but there may exist more than one assignment. From 
the proof of Proposition El we have: 

Proposition 7. Instances of tagged real schemas have a single type assignment. 

Given a VSCMDLschema S, we call the problem of testing if every instance 
of S has a unique (virtual) type assignment, the unique (virtual) assignment 
problem. Observe that having a unique virtual assignment always implies having 
a unique type assignment, but not vice versa. 

Theorem 3. 

(1) The unique (virtual) assignment problem is in general undecidable. 

(2) The problem is decidable for real schemas. 

Proof, (sketch) We prove (1) by reductions to the problem of deciding the empti- 
ness of the intersection of two context free languages (for unique assignment), 
and of testing ambiguity of context-free languages (for unique virtual assign- 
ment). The proof for (2) involves intricate analysis of the possible causes for non 
unique typing in real schemas. 
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6 Schemas and Queries 

Often in data integration or translation one may be interested in only a part of 
the data in a source, defined by a query. For example, a query in an integration 
system is broken into components that are pushed to the sources; only the corre- 
sponding results are translated and integrated. A natural issue that arises here 
is to derive the schema of a query result. This can, e.g., facilitate the translation 
of the retrieved data. In the opposite direction, query evaluation on a source can 
benefit by using schema information to prune the search. This well known idea 
of classical database systems is attracting attention in systems that manipulate 
heterogeneous and semi-structured data In this direction, one would 

like to derive schema information for node variables in a query, and use it to 
restrict the search. 

Query languages for semi-structured data, like their classical ancestors, have 
a body and a head fBl [Ql [D] ■ In the body, node variables are introduced, 
operated upon by predicates, and related by edges and paths. We consider here 
the component that seems to be the more difficult to treat, for deriving schema 
information, namely generalized path expressions fTTurra FTieiT^. of the form 
xq Pi xi P 2 X 2 ■ ■ ■ Pn Xn where the Xi’s are variable names, and the Pi’s are 
regular expressions or path variables. Intuitively, given a data graph G, such 
expressions search for nodes vo,...,Vn s.t. the path between Vi-i and Vi,i = 
1 . . .n, matches Pi, (and uq is a root node). Determining the possible types for 
the Vi’s can help both in determining the schema of the query result and in 
pruning the search. 

Formally, we consider path expressions of the form P = Xq R\ x\ ... Rn Xn 
where the Xi’s are distinct variable names, and the Ri’s are regular expressions 
over predicates in 7^0 (The case where some of the variables may be identical will 
be considered later). Given a data graph G and a path p = uq ^ u\ ^ ^ Uk 

in G, where the label of Ui is k, for i = 0, . . . , k, we say that p matches Ri iff 
the language defined by Ri contains a word w = po . . .pk s.t. pi{li) holds, for all 
i = 0, . . . ,k. We say that the nodes vq,vi, . . . ,Vn that lie on a path in a data 
graph G satisfy the generalized path expression P = xq Ri xi ... Rn Xn, if vq 
is a root node, and for all j = 1 . . .n, there exist a path from Vi-i to Vi that 
matches Ri. 

Now, we say that the vector of types to,ti, . . . ,tn in a VScmDL schema S is 
a possible type assignment for Xq, . . . ,Xn, respectively, if there exists a data 
graph G that conforms to S w.r.t a type assignment h, and nodes uq, fi, . . . , G 
G satisfying P, s.t. h{vi) = ti, i = 0 . . .n. Each such possible type assignment 
describes the types of objects in some occurrence of the path expression. Given 
a schema S and a generalized path expression P, our goal is to find the set of 
all possible type assignments for the variables in P. We note that the size of the 
answer may be large, compared to the schema: if the schema is very loose, it may 
be the case that each of the variables can be associated with most of the types 

^ Note that these regular expressions are different than the ones used in our schema 
language and which are defined over types in T and not over predicates. 
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in the schema, so the size of the answer can be up to i.e. exponential 

in the size of the schema. So rather than measuring the complexity of the type 
computation only in terms of the size of the schema and the query, it is useful 
to also take the size of the answer into consideration. 

Theorem 4. Given a schema S and a path expression P , the possible type as- 
signments for the variables in P can be computed in time polynomial in the size 
of S , P, and the number of possible type assignments +10 

Proof, (sketch) To prove the theorem we first show that for every two types 
t' , t" in S and every regular expression i?^, it is possible to define (in polynomial 
time) a regular language describing all the possible paths that start at a node 
of type t', end at a node of type t", and match Ri. Then we use these languages 
to compute the possible type assignments. 

Next, we consider how computing the possible types of query variables may 
be useful for optimizing the query evaluation, by allowing to prune the search 
space. 

Definition 8. Given a schema S, two types t',t" in S, and a regular expression 
R over V, we say that a type t in S is useful w.r.t t' ,t" ,R, if there exists a 
data graph G that conforms to S with a type assignment h, and a path p = uq — > 
vi ^ ^ Vn in G that matches R, s.t. h{vo) = t' , h{vn) = t" , and h{vi) = t 

for some 0 < i < n. 

Theorem 5. For every t',t”,R, the set of useful types w.r.t t',t'',P can be 
computed in time polynomial in the size S and R. 

At each step in the computation of a generalized path expression P, one holds 
some node vt (initially the root node vq) and searches for paths that match a 
path expression Ri+i, looking for the nodes Vi+i at the end of such a path. If we 
know the type assignment for nodes, and in particular the type t' of Vi, then we 
can prune the search whenever we run into a node of type t that is not useful 
w.r.t t' , Ri+i, and any of the types t" in S. 

Moreover, if we pre-compute the possibly type assignments for P, then we 
can do even better: Given the type of a current node (and the types of the nodes 
assigned to previous variables in P), we can check what are the possible type 
assignments for the next variable, and prune all paths going through non-useful 
types w.r.t these possible assignments. 

Remark: When variables may occur several times in a path, we can remove 
from the type assignments all those vectors in which different occurrences of the 
same variable have been assigned distinct types. Note however that this still does 
not guarantee that all the remaining vectors are actually possible assignments, 
i.e. that there exists an instance with a path where the two occurrences of the 
variable have been assigned the same node, and not just the same type. Selecting 
the appropriate vectors in this case is still an open problem. 



5 



The “+1” is for the extreme case where there aren’t any. 
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7 Conclusion 

We have presented two schema definition languages, and illustrated their ex- 
pressive power. In particular, we presented motivation for augmenting the basic 
language with the notions of virtual nodes and types, and showed these extend 
the expressive power. We also investigated the complexity of significant problems 
regarding schemas and instances, showing that in many cases they can be solved 
efficiently. Hence, we believe that our schema languages, particularly ScmDL, 
present a good combination of expressibility and tractability, and that our ap- 
proach and results can be used to enhance the usability of data translation and 
integration systems, as illustrated in the TranScm system. 

As we mentioned, there has been recent work on schematic description of 
semistructured data 131111113. However, these works focus more on describing 
patterns of paths in the data than on precisely describing its structure. For 
example, they can specify that edges with a given property going out of a node 
may exist, but they cannot require their occurrence, which can easily be done in 
our language. Thus, our work is much closer to classical approaches to schemas 
and types. 

A significant recent work that presents a comparable approach is YAT [TTj . 
They have order in their data and schemas, (but there all the nodes are ordered, 
which is less natural when modeling unordered components like sets). More 
importantly, it can be seen that their schema language essentially uses regular 
expressions, and is thus similar to ScmDL, but they do not support virtual 
types. While their instantiation mechanism provides a notion matching of data 
to a schema, it does not provide an explicit type assignment to instance nodes 
as our formalism does. But it offers subtyping that we do not treat here. Their 
notion seems to differ from ours in some subtle points, e.g., in the treatment of 
the combination of *, | in regular expressions. Such differences may impact some 
of the complexity results we have obtained for our model. While they mention 
matching of data to schemas, and inferring the schemas for queries (although 
without regular paths), they do not mention complexity results. Further study is 
needed to clarify the differences, and to investigate the impact on the complexity 
of significant problems. 
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Abstract. We study the problem of rediscovering the schema of nested 
relations that have been encoded as strings for storage purposes. We 
consider various classes of encoding functions, and consider the mark- 
up encodings, which allow to find the schema without knowledge of the 
encoding function, under reasonable assumptions on the input data. De- 
pending upon the encoding of empty sets, we propose two polynomial 
on-line algorithms (with different buffer size) solving the schema finding 
problem. We also prove that with a high probability, both algorithms 
find the schema after examining a fixed number of tuples, thus leading 
in practice to a linear time behavior with respect to the database size for 
wrapping the data. Finally, we show that the proposed techniques are 
well-suited for practical applications, such as structuring and wrapping 
HTML pages and Web sites. 



1 Introduction 

Databases traditionally provide consolidated technologies for organizing and ma- 
nipulating structured data. However, in recent years, alternative approaches to 
data organization have imposed themselves as de facto standards. A prominent 
driving force behind this has been the World Wide Web, which delivers infor- 
mation in less structured form, essentially as hypertextual documents. 

It has been argued in several contexts that document management still lacks 
the flexibility offered by database systems. Therefore, a number of recent pro- 
posals aim at extending database-like techniques to textual data coming from 
the Web. In most cases, in fact, Web documents have a rather tight structure 
that can be assimilated to a database type. For example, a large number of or- 
ganizations are now providing a Web access to their information systems and 
databases. Several commercial packages now support the activity of publishing 
in HTML format data coming from a DBMS. In this process, a collection of 
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database objects — such as for example tuples corresponding to employees — 
are somehow “encoded” into HTML pages, to be browsed by users. Of course, 
these generated pages clearly reflect the regularities of the initial database. 

In many cases, users do not have access to the original database: All they 
can access are the HTML flies on the site. Therefore, in order to be able to 
apply database-like query techniques to Web pages, a key problem is to identify 
relevant pieces of information inside text, and to extract them in order to build 
an internal representation in some data model, which can then be manipulated 
by a query language. This process, which is usually referred to as a wrapping 
of a data source, can be seen as an attempt to “decode” the original database 
objects that have been encoded as HTML flies for publishing purposes. 

In this paper, we formally investigate exact methods for this Schema Finding 
Problem (SFP), which can be informally stated as follows: Given a collection of 
strings encoding instances of some database type, derive the type and a database 
representation of the original instances. 



1.1 The Framework 



To show a practical example of a schema finding problem, consider the Nagano 
Winter Olympics Web site |JN a^| : The site contains data about athletes and 
events at the last Winter Olympics, and has been implemented using an object- 
relational database as a back-end |l;a,sfl8| . 
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Fig. 1. Athletes at the Winter Olympics 



Coming from a database, pages in the site have a rather tight structure. For 
example, two pages from the site are shown in Figure ^ they report data about 
two athletes, Manuela di Centa and Ian Piccard. For each athlete, the name, the 
country, and a list of performances are reportecQ Note that the site contains 

^ The page also contains several links, like, for example, a link to a biography, to the 
country page, and to each competition page. To simplify the discussion, we ignore 
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a page of this kind for each competing athlete. We can abstract the content of 
such pages as a counterpart to database objects of a nested type athlete, with 
attributes name and country of type string, plus an attribute performances, a 
collection of tuples, each reporting the event, placement and performance, as 
follows: 

type athlete ( name: string; 

country: string; 

performances: set ( event: string; 

placement: string; 

performance: string; ) ) 

Given such a precise structure, it would be reasonable to think of using some 
high-level tool to ask complex queries, like “Find all athletes in the site that 
won at least two medals”. Unfortunately, we can’t directly query the underlying 
database: the only way of accessing the site is by downloading raw-text HTML 
pages through the network and inspecting their content. We have no access to 
the original objects in the database, but rather to their HTML-encoded versions. 
This, of course, complicates the query process. 

Note, in fact that whenever we download a page corresponding to a database 
object, the underlying structure is completely lost in favor of some HTML tag- 
ging. Before actually being able to query the site, we have somehow to re-discover 
the lost database schema, and, based on that, decode the original objects from 
the HTML strings, i.e., we face an instance of what we call the Schema Finding 
Problem. 



1.2 Related Work 

The early approaches to structuring and wrapping Web sites were essentially 
based on manual techniques fAM97IHGM(U97ICM98j : these proposals assume 
that a human programmer examines a site and manually codes wrappers that 
abstract the logical features of HTML pages and store them in a local database. 
The focus, here, is on the development of languages and tools to support this 
wrapping process. 

A number of other approaches have studied the problem of (semi-) automat- 
ically inferring the schema of a collection of fairly well structured Web pages. 
We briefly mention some in the following. Most of these approaches heavily rely 
on the use of heuristics. For example, in |AK97j . the authors develop a practical 
approach to identify attributes in a HTML page; the technique is based on the 
identification of specific formatting tags (like the ones for headings, boldface, 
italics etc.) in order to recognize semantically relevant portions of a page. 

An alternative approach is the one developed in [KWD97j and |Ade98j . In 
these cases, it is assumed that the wrapper generator has some a-priori knowl- 
edge about the semantics of a page. In ^<^WD97j . oracles are used to this end. 

the links here. In Section 0 we will discuss how to deal with attributes that are links 
to other pages. 
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They are software modules embedding lexical knowledge about a particular ap- 
plication domain, used to recognize all occurrences of a relevant attribute in a 
page. In the user itself manually starts the semi-automatic structuring 

phase, by interactively labeling some example files, that is, by showing to the 
system some example attributes and their semantics. Then, based on this accu- 
mulated knowledge, in both cases a wrapper generator tries to infer the patterns 
according to which data items are organized in the target pages. This is accom- 
plished by examining one page at a time, and progressively refining a grammar 
for the pages. 

Similar problems have also been studied in other frameworks. For example, 
the work in tBriHHI aims at extracting sets of tuples of a pre-determined type 
(pairs of book titles and author names) not from a set of homogeneous pages 
in a site, but from possibly heterogeneous pages in the whole Web. In jlN A IVI 
the a uthors add ress the problem of clustering similar objects in a semistructured 
Lore !ACC+97| database, i.e., inferring some form of common schema starting 
from schemaless data. 

More generally, the problem of identifying the common structure (in 2 di- 
mensions) of a collection of objects that are available only as one-dimensional 
sequences has gained a lot of attention recently for its applications in molecular 
biology |Wat89I.HK^ . 

1.3 Contributions of the Paper 

In this paper, we develop a formal framework for studying the Schema Finding 
Problem. Our approach departs from all the ones discussed above, in several 
respects. In fact, we aim at studying how far a completely automatic approach 
can be pushed in order to solve the schema finding problem; therefore, differently 
from EKHZ! , our approach makes very little use of heuristics; also, we do not as- 
sume any prior knowledge about the input data, as opposed to |KWD97IAde98j . 
In fact, we are not primarily concerned with identifying the semantics of data 
in a page (i.e., the fact that a given string is the name of an athlete), but rather 
focus on the problem of recovering types. More specifically, given a set of encoded 
instances of a type T, we concentrate on the task of re-discovering the structure 
of type T. In this framework, a critical issue in recovering the scheme from a set 
of encoded instances is the ability of detecting collection (i.e., set) types from 
aggregation (i.e., tuple) types. Studying this relationship is a primary goal of 
the paper. 

The techniques we use are also different from other proposals in the literature. 
In fact, given a collection of pages of a given logical type, instead of working on 
one page at a time, we try to infer the type by comparing different pages and 
looking at textual similarities between them. We formalize this process as a 
decoding process. 

We identify nested relations as a natural abstraction for Web pages. This 
is motivated by several observations. First, HTML pages often have a nested 
structure, due to the presence of collections of items (like lists or tables); also, 
many Web sites are now built starting from relational and object-relational 
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databases, and in some sense reflect the (possibly) nested structure of the un- 
derlying database. In fact, it has been shown in |A M M fl7IM M M that a data 
model based on nested relations is a promising abstraction for describing the 
structure of a large number of actual Web sites. Of course, this choice is by no 
means the most general one, and probably cannot capture all of the richness 
present in HTML pages. However, we feel it to be a good starting point for 
studying the schema finding problem. 

To settle the background, in Section |21 we formalize the notion of encoding 
of database instances into sequences, and introduce various classes of encodings. 
We first consider a general version of the schema finding problem (SFP), where 
the encoding function is part of the input. In the general case, we show that the 
SFP is partial recursive if the encoding function is recursive. We then refine the 
class of encoding functions and relate the space complexity of the SFP to the 
time complexity of the encoding. 

In practice, the encoding function is not given with the encoded data. It is 
therefore desirable to be able to solve the SFP without knowledge of the encoding 
function. We further restrict the encoding functions, and introduce the mark-up 
encodings, which satisfy the previous requirement, and, at the same time, nicely 
abstract SGML-like languages rnroa - like HTML - which are used in practice 
to encode information. Essentially, the mark-up encodings map database objects 
into strings in which the constants are encoded as strings over some alphabet, 
A, and the structure is encoded by arbitrary tags in some other alphabet, E. 
We distinguish between strict and loose mark-up, depending upon the encoding 
of empty sets. In the strict case, the structure of empty sets is maintained in the 
encoding, while it is lost in the loose case. 

Then, in Section 01 we restrict our study to data that can be represented as 
nested relations and develop an algorithm that takes as input a set of strings and, 
by successive alignments, progressively refine their type. Our approach relies on 
a close correspondence between nested objects and regular expressions. In fact, 
we show that, given a collection of encoded tuples, finding a database type that 
subsumes them is reducible to finding a regular expression defining a language 
containing the input strings. Note that the containment problem for regular 
expressions is known to be pspace complete |Fap94| . However, dealing with 
restricted classes of expressions, our algorithms run in polynomial time. 

The main contribution of the paper is the definition of two on-line algorithms 
for solving the schema finding problem in the case of mark-up encodings. The 
first algorithm works on strict mark-up, and uses a buffer of size linear in the 
size of an object (e.g. a tuple of a relation). The second algorithm works in the 
loose case and uses a buffer whose size is linear in the size of the loaded data. 
Both algorithms are shown to work in polynomial time. 

Finally, in Section EJ we show how in practice, these algorithms can be used 
to load data from the Web into a database by first extracting the schema, and 
then wrapping HTML pages according to the schema. We also exhibit examples 
of HTML documents on which we illustrate the use of the proposed techniques. 
The algorithms are under implementation as part of a system for schema-finding 
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on the Web. For space reasons, proofs of the theorems and code of the algorithms 
are omitted. 

2 The Encoding of Data 

In this section, we describe the abstract data to which we restrict our attention, 
and introduce formal methods for encoding it into sequences of characters. We 
also define our fundamental problem. 

2.1 Abstract Data 

The abstract data we consider are nested relations |Alj95IHui88| . that is typed 
objects allowing nested sets and tuples. The types are defined as follows. We 
assume the existence of an atomic type U , called the basic type, whose domain, 
denoted by dom{U) is a countable set of constants. Other types (and their respec- 
tive domains) can be recursively defined as follows: {i) If Ti, . . . ,T„ are basic 
or set types, then [Ti, . . . , T„] is a tuple type, with domain dom{\Ti, . . . , T„]) = 
{[ai, . . . , a„] I Gi G dom{Ti)}; (ii) If T is a tuple type, then {T} is a set type, 
with domain dom{{T}) = V f{dom{T)) (where Vf{S) denotes the collection of 
finite subsets of S). 

Given a type T, an instance of T is an element of dom(T). Let X be the set 
of all instances. A relational instance is an instance of a set type. Classical (flat) 
relations are instances of unnested set types, and nested relations are instances 
of arbitrary set types. Types can be conveniently represented as trees [Hul88j . 
which can be constructed recursively as follows, (z) The basic type, U, is a leaf 
tree, U. {ii) A tuple type [Ti, . . . , T„] corresponds to a tree rooted at a tuple node, 
with n subtrees, one for each Ti. {Hi) A set type {T} corresponds to a tree rooted 
at a set node, with one subtree. Similarly, instances can also be represented as 
trees, with constant instances, c, represented as a leaf tree c; tuple instances, 
[ti, . . . ,tn], represented as a tree rooted at a tuple node, with n subtrees, one 
for each tp, and set instances {tp, . . . ; tk\ represented as a tree rooted at a set 
instance node, with k subtrees, corresponding to the elements ti, ... ,tk- If / is 
an instance, any subtree of / is a subinstance of I. 

We introduce a labeling of type trees, recursively defined as follows. The 
root is labeled by the empty string e. If a set node is labeled a, then its child is 
labeled a.O; and if a tuple node with n children is labeled a, then its children 
are labeled a.l, . . . ,a.n. Instances are labeled similarly, with the children of a 
set node labeled a, all labeled a.O. So an instance and its type have the same 
set of labels. 

Instances can sometimes underuse their type. It has been shown in j(IV95] . 
that it has a strong impact on the expressive power of query languages. In 
the present context, we show that it is fundamental for finding the schema. An 
instance underuses its type for example when set types are used to model objects 
that always have the same cardinality (and would have been more accurately 
typed with a tuple), or when the attribute of a tuple always has the same value 
(and could have been omitted). We define this notion formally. 
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Definition 1. A collection of instances I is rich with respect to a type a if it 
satisfies the following two properties: (i) set richness; for each label a of a set 
node, there are sets of distinct cardinalities labeled a, and (ii) tuple richness; /or 
each label a of a tuple node of arity k, and each i < k, there are distinct objects 
labeled a.i 

Intuitively, if a collection of instances is rich with respect to a type, it makes 
full use of this type, and we will see in the sequel that it then contains enough 
information to recover the type. 



2.2 Encoding of Abstract Data 

We next study the encoding of instances into sequences of characters over a finite 
alphabet E. We define an abstract encoding function, enc, as a 1-1 mapping from 
the set of all instances I to if* . We consider the problem of reversing an encoding, 
that is given a string in E*, find the abstract data that it represents and its type. 

For computation purposes, the abstract data is manipulated as strings ob- 
tained by a simple standard encoding based on a parenthetical representation, 
which is essentially the one used in the paper to ’’write” abstract objects. We 
consider an ordered alphabet A. The order on A induces a (lexicographic) 
order on A*, which in turn induces an order on the set of all instances I 
(/ < J if enc{I) < enc{J)), and in particular on U. The standard encod- 
ing f of an instance I is defined recursively as follows. Constants are encoded 
by words in A*. Tuples [ai,...,a„] by the string [^(oi), . . . , ^(a„)], and sets 
.,a„}, by the string {^(a* J, . . . , ^(a*„)}, where /(cij < ... < 

Let I C (Z\ U be the set of standard representation of 

instances. 

Concrete encoding functions are 1-1 mapping from I to E*. When it is clear 
from the context we simply say instance for either the abstract instance or 
its standard representation, and, similarly, encoding function for abstract or 
concrete encoding function. Let us now define formally the problem. 

Definition 2. Schema Finding Problem (SFP) 

Input; An encoding function enc, and a finite collection W of strings of E* . 
Output; A type a, and a collection C of standard representation of instances 
of type a, such that enc : C — > W is a bijection. 

We can immediately make the following general observation. If enc is a re- 
cursive function, then the schema finding problem is partial recursive. Indeed, it 
suffices to enumerate all possible instances, I G I, and check if enc{I) = w for 
some w G W. If IF is a collection of strings corresponding to instances, then it 
halts when all such instances have been enumerated. 

We now consider the complexity of the schema finding problem in the size of 
the collection W. The size |w| of a sequence w is defined as usual as the length of 
w. The size |/| of an instance / is defined as the size of its standard encoding as 
a sequence The domain of an instance is the set of constants occurring 
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in the instance, and for the standard encoding of the instance, the domain is the 
set of words in A* encoding the constants. 

It is not sufficient to restrict the complexity of encoding functions in order 
to bound the complexity of the schema finding problem. We next propose a 
sufficient restriction on the encodings to ensure a complexity bound on the SFP. 
We say that an encoding enc is non-compacting, if there exists a constant C 
such that for any instance I, |/| < C|enc(/)|. We can now give the correlation 
between the complexity of the encoding and the complexity of the SFP. 

Theorem 1. Let enc be a non- compacting encoding function having complexity 
TiME(/(n)), (f{n) > n). Then the SFP can he solved in SPACE(/(n)). 

The previous result provides sufficient conditions for bounding the complex- 
ity of the schema finding problem. These conditions do not ensure tractability 
though. Moreover, we would like to restrict further the class of encodings, to 
obtain an encoding-invariant solution to the SFP. This is a desirable property, 
since in practice, encoding functions are not given, and only the encoded data 
can be found. So our approach is to find classes of encoding functions such that 
the SFP can be decided without the knowledge of the encoding function. Note 
that this is a strong assumption, which implies that the result does not depend 
upon the chosen encoding. For that purpose we introduce a restricted class of 
non-compacting encodings, namely mark-up encodings. 

2.3 Mark-Up Encodings 

In a mark-up encoding the schema and the data are encoded in disjoint finite 
alphabets, the schema alphabet, S, and the data alphabet, A. As for the standard 
encoding, constants are encoded as words in Z\+, by some data encoding, 5, that 
is a 1-1 mapping from U to A+. 

Consider a type cr, and a corresponding labeled tree T^.. We define a tagging 
function as a function that associates a pair of strings, called start and end tags 
to each label in C, in such a way that respectively basic, set and tuple labels 
have different start and end tags. More formally, let 6> be a prefix subsei0 of 
E*. Let 6>i U 02 U 03 be a partition of 0. A tagging function is a mapping from 
the set of labels £ of To- to 0 x 0, which associates to each basic (resp. tuple, 
set) label a S £ a pair of strings in 0i (resp. 02 , 0s), denoted by start{a) and 
end{a), with start{a) ^ end{a). 

Definition 3. A (loose) mark-up encoding enc based on a structure (A, A, 

,5, tag) — where S and A are two disjoint finite alphabets, <a is an order on 
A, S a data encoding, and tag a tagging function — is a function recursively 
defined on the tree of an instance as follows: 

— for a leaf node u G U with label a, ene{u) = start{a) ■ S(u) ■ end(a); 

— for a tuple node [Ti, . . . ,Tn] with label a, enc([Ti, . . . , T„]) = start{a) ■ 
enc{Ti) ■ . . . ■ enc{Tn) ■ end{a); 

Recall that a set of strings O is called a prefix set if no word of 0 is a prefix of 
another word of O EE311- 
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— for a non-empty set-instance node {Ti; . . . ; T„} of label a, enc{{Ti , . . . ; T„}) 

= start{a) ■ enc{Ti^) ■ ■ enc{Ti^) ■ end{a), where set elements have been 

recursively ordered according to <a; empty set-instances are encoded by 
start{a) ■ end{a). 

We distinguish between loose and strict mark-up encodings, based on the 
encoding of empty sets. Suppose U contains a null value null. Given a set type 
cr, we define its null instance, null {a) as an instance whose tree is obtained from 
cr by replacing each leaf node U with a leaf node null. A strict mark-up encoding 
is defined as a loose mark-up encoding with the following difference: the encoding 
of an instance / is the encoding of instance I' obtained by replacing in / each 
empty set-instance of a type a by null{a). 

The definition of loose (strict) mark-up encoding can be extended also to 
types. In this case, function enc associates a regular expression enc{a) over (AU 
A)* with each type cr as follows: 

— for a leaf node U labeled a, enc{U) = start{a) ■ (/I)"'' • end{a)-, 

— a tuple node is encoded as above; 

— for a set node {T} with label a, enc{{T}) = start{a) ■ {enc{T))* ■ end{a); 
(enc({T}) = start(a) ■ {enc{T))'^ ■ end{a) for strict encoding); 

Since the mark-up encoding of an instance is a string in (SUA)*, and the mark- 
up encoding of a type is a regular expression over (A U Z\)*, we can establish 
a close correspondence with the theory of regular languages; in fact, the class 
of mark-up encodings has the following property. As usual, we note L{exp), the 
language defined by a regular expression exp. 

Proposition 1. Given a (strict/loose) mark-up encoding enc based on a struc- 
ture {E, A, <A,S,tag), then, for each instance I of type a, enc{I) G L(enc(cr)). 

Figure 121 shows several types and instances with their corresponding encod- 
ings with A — {1,2,3, NULL}, and E containing any character not in A. Note 
that empty sets are encoded with null values in the strict mark-up encoding, 
while they are simply omitted in the loose mark-up encoding. 

For a collection of instances J, we denote by enc{J), the collection of encoded 
instances of ff . The class of mark-up encodings verify the following fundamental 
property, which states that the choice of a specific encoding is irrelevant. 

Theorem 2. For all mark-up encodings enc, enc' over the same E, A, 5, and 
all set-rich collections of instances J,J' of type a, if enc{ff) = enc'{J'), then 

J = J'. 

The proof is made by induction on the nesting of the instances. TheoremElhas 
the fundamental consequence, that the decoding can be done without knowledge 
of the encoding function. In the next section, we investigate the SFP in this 
context . 
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(Ti = [U, {[U, U]}] 


strict mark-up encoding 
<t>eZ\’^/e<s>(<t>eZ\^/eeZ\’*'/e</t>)'^</s></t> 


loose mark-up encoding 
<t>eZ\^/e<s>(<t>eZi’^/eeZ\^/e</t>)*</s></t> 


(72 = [U, {[U]}] 


strict mark-up encoding 
romaregesZi^unl (aeZ\^ iacta) ^ tpqr 


loose mark-up encoding 
romaregesZl^unl (aeZ\^ iacta) * tpqr 


h = [1, {}] of type (71 


strict mark-up encoding 

<t>el/e<s><t>eNULL/eeNULL/e</t></ sx/t> 


loose mark-up encoding 

<t>el/e<s></sX/t> 


h = [1, {[2]; [3]}] of type a 2 


strict mark-up encoding 
romaregesllunlae22iactaae33iactatpqr 


loose mark-up encoding 
same as strict 



Fig. 2. Examples of encodings 



3 Schema Finding for Mark-Up Encodings 

In this section, we introduce several algorithms to solve the SEP problem assum- 
ing mark-up encodings. We first restate the problem in this special framework: 

Definition 4. Schema Finding Problem for Mark-up Encodings 
Input.- S,A,6 as defined above, and a finite collection W of strings of {EU A)* . 
Output.- A type a, and a collection C of instances of type a, such that there is 
a mark-up encoding enc such that enc : C ^ W is a bijection. 

For simplicity, we focus in the sequel on inputs consisting of a finite collec- 
tion of encodings enc(/i), enc{l2), ■■■, enc(/„) of instances I\, I2, ■■■, In of 
a nested tuple type cr, according to some mark-up encoding function enc. Our 
algorithms work on encoded instances, and progressively construct the type by 
generating templates. Templates generalize both types and instances, and repre- 
sent partially specified types. Together with templates, we introduce a reflexive 
subsumption relation between templates, denoted by which extends the rela- 
tionship between a type and its instances. T = [c, { [U] }] and T' = [U, { [a, 
U] ; [a, U] ; [a, U] }] are examples of templates. It is easy to see that they 
correspond to partially specified types. More specifically, T subsumes any tuple 
of two attributes, the first one being a c, and the second one any set of monadic 
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tuples. T' subsumes any tuple of two attributes, the second one being a set of 
exactly three binary tuples, having a as first attribute. Types and instances of 
Figure |2| provide more examples of templates, satisfying the relation: type cti 
subsumes T', and ct 2 subsumes T. Templates and the subsumption relation can 
be formally defined as follows: 

— Every element u € U is a, constant template; a constant template subsumes 
itself. 

— U, the basic type, is a basic template; the basic template, U , subsumes every 
template in dom{U). 

— If Ti, . . . ,Tn are templates, then [Ti,...,T„] is a tuple template; a tuple 
template, [Ti, . . . , T„] subsumes any template in {[ti, . . . , t„] \ ti ^ Ti)}. 

— If T is a template, then {T} is a set template; a set template, {T}, subsumes 
any template in Vf{{t \ t ^T}). 

— If Ti, . . . , Tfc are templates, then T = {Ti; . . . ; T^} is a set instance template; 
fc > 0 is called the cardinality of T; a set instance template subsumes any 
template in {{ai; . . . ; ak} \ Oi ^ Ti}. 



It is easy to see that: (i) every type is a template, constructed using only tuple, set 
and basic sub-templates; (ii) every instance of a type is itself a template, made 
of tuple, set-instance and constant sub-templates. In the following, we blur the 
distinction between a types or an instance and the corresponding template. Note 
also that each template corresponding to a type subsumes all of its instances. 

We denote by T the universe of all templates. The relation ^ defines a partial 
order on the set r| We say that two templates Ti , T 2 are homogeneous if they 
admit a common ancestor, that is, there exists a template T G T such that 
T\ <T and T 2 ^ T . Intuitively, two templates are homogeneous if they represent 
objects that are subsumed by the same type. It is now easy to see that, for each 
maximal set of homogeneous templates, Th, (Th,^) is a join-semilattice. Given 
a template T, we denote by 7f(r) the class of all templates homogenous to T. In 
the following, unless explicitly specified, we will always refer to join-semilattices 
of homogeneous templates. Given a finite collection S of homogeneous templates, 
lub(S') will denote the least upper bound of elements in S in the corresponding 
join-semilattice . 

Gonsider the relationship between a type and its instances. Since a type 
subsumes all of its instances, given a set of instances X = {Ji, ...,/„} of some 
type cr, cr is a common upper bound of X in the join-semilattice of templates 
homogeneous to cr, Ti{a). The following proposition shows that, if X makes full 
use of its type according to the definition given in Section |21 then this upper 
bound is the least one. 

Theorem 3. Given a set of instances X = {/i, . . . , /„} of type a, X is rich for 
a iff a = lub(X). 

® Similar orderings for database types have been used also in other frameworks; see, 
for example, Ezm. 
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This result shows that solving the schema finding problem amounts to com- 
puting the least upper bound of a set of instances according to the template 
subsumption relation. However, we are not dealing with abstract objects, but 
with strings that encode them. We next show how to solve the problem using 
regular expressions. 

First, note that the tree representation of types and instances, the tree la- 
beling function and the notion of mark-up encoding introduced in Section [^ex- 
tend immediately to templates. There is a close relationship between template 
subsumption, and the familiar concept of containment, C, between regular 
expressions, as stated by the following theorem. 

Theorem 4. Given a set-rich collection! = {/i, . . ./„} of instances of a type a, 
and a mark-up encoding enc, let us call enc{T-L{(j)) the image ofTi.{a) according to 
enc. Then, (enc(7L(cr)), C) is a join- semilattice, and.' LUBc(enc(Ji), . . . , enc(/„)) 
= enc(LUB^(/i, . . .,/„)) 

Another important fact is the following: 

Proposition 2. Given an encoding enc{T) of a set-instance free template T, it 
is possible to derive T from enc(T) in linear time. 

Theorem 0 and Proposition |3 suggest that we can solve the schema finding 
problem working on strings and join-semilattices of regular expressions 0 Given 
a set of encoded instances of a nested tuple type a, S = {ei, . . . , e„} according to 
some mark-up encoding, the strategy to solve the schema finding problem is: (z) 
to find the regular expression encoding a as the least upper bound = LUBg {£); 
(ii) from the regular expression, to construct a; (Hi) based on the grammar 
defined by Co-, from each Cj to derive a representation of an instance Ij of a. 

3.1 Schema Finding with Strict Mark-Up 

Let us first consider the case in which inputs are encoded using a strict mark- 
up encoding function. Recall that, for every join-semilattice, the least upper 
bound operator is associative, and therefore, given a set of elements, £, we can 
progressively compute the least upper bound of the set, independently of the 
order ei, . . . for elements in £, based on the following iterative algorithm. 

J lubi = Cl 

\ lubi^i = L\JB(lubi, Ci+i), for z = 2, . . . , /c 
Equations above suggest an efficient on-line strategy to solve the schema finding 
problem with mark-up encoding, assuming we know how to compute least upper 
bounds of regular expressions. Note that computing upper bounds of regular 

With respect to Theorem 0, note that the two join-semilattices, iTL(o-),<) and 
(enc(Ti((j)) , Cf) , are not in general isomorphic. This is due to the fact that the 
notion of subsumption between set and set-instance templates is essentially order- 
independent, whereas containment between regular expressions depends on the ac- 
tual order according to which elements in set instances are listed. 



326 



Stephane Grumbach and Giansalvatore Mecca 



expression implies testing containment. The containment problem for regular 
expression is complete for pspace . However, in this context, we deal with 

rather simplified regular expressions, for which we prove that the containment 
problem is in ptime. This is due to the very limited use of union - essentially 
only as a part of A~^ - and to the fact that subexpressions are clearly marked 
using tags. 

We have developed an algorithm, called matchSchema, that computes the least 
upper bound of two expressions of this sort in polynomial time, as follows: 

Theorem 5. Given two regular expressions 61,62 corresponding to encodings 
of homogeneous templates Ti,T2 G Th, according to encoding enc, let n = 
maa;(|6i|, I62I). matchSchema computes lub(6i,62) S enciTn) in time 

0 {n^log{n)) . 

Algorithm matchSchema tries to align the two input expressions. In presence 
of mismatches, it attempts to dynamically change the inputs, and then recur- 
sively align again. Every time an input is changed, it is transformed into a new 
regular expression, corresponding to a template that subsumes the previous one. 
The algorithm terminates when, by successive changes, the two input have been 
transformed into a common regular expression, which is the least upper bound. 

The procedure findSets is the core of the algorithm. It tries to find repeated 
contiguous patterns of the form u>” corresponding to elements of set instances 
inside a regular expression, and replace them by {w)~^ . This process is far from 
being a trivial task, and implies a number of subtleties. First, the procedure is 
recursive, since sets may contain nested sets. Second, not all contiguous squares 
actually correspond to sets of homogeneous elements. To see this, consider the 
following tuple: [a{ [b{ [b. . .] ; here, square { [b is due to a repetition of structure 
at different levels, and not to instances of a set. 

Note that, in the case the input collection of encoded instances is not rich for 
its type, the on-line algorithm computes a regular expression e which corresponds 
to a template subsumed by cr, that is, to a partially specified type. However, if 
the input is set-rich, we can recover the schema. 

Proposition 3. Given a set of regular expressions £ = {ci, . . . , 6„} correspond- 
ing to instances of some type a according to mark-up encoding enc. If £ is set-rich 
with respect to a, then: e = LUB(ei, . . . , e„) A enc(cr) and a can be built from e 
in linear time wrt the size of e. 

The proof follows from the fact that, since £ is set-rich, e is set-instance 
free, and therefore, by Proposition El cr is uniquely identified. Proposition 01 is 
very useful in a number of practical cases, as discussed in Section El Based on 
matchSchema, we can define algorithm schemaFindingl to solve the SFP for strict 
mark-up encodings as follows: the algorithm progressively loads the encoded tu- 
ples and compares them; it first applies matchSchema on the first two tuples and 
then iteratively applies matchSchema on the previously obtained regular expres- 
sion and each new tuple. Algorithm schemaFindingl works on-line, using a buffer 
of size bounded by the size of the longest encoded tuple. Since matchSchema runs 
in polynomial time, the overall computation is polynomial. 
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3.2 Schema Finding with Loose Mark-Up 

Consider now the schema finding problem in the case of loose mark-up encodings. 
In this case, empty sets are encoded simply by the start and end tags, so that 
their inner structure is not visible. Algorithm matchSchema fails on instances 
containing empty sets under loose encoding. 

We define the notion of empty-set free template as a template T such that all 
subtrees of T corresponding to set-instance templates have cardinality greater 
than 0. A template is said to have empty sets otherwise. The following proposition 
justifies the use of Algorithm matchSchema on loose mark-up encodings. 

Proposition 4. Given a loose mark-up encoding function, enc, and two regular 
expressions, Ci,e 2 corresponding to encodings of homogeneous templates Ti,T 2 , 
call e' the output o/ matchSchema on 61,62/ then: {i) if both Ti,T 2 are empty-set 
free templates, then e' = lub(Ti,T 2 ); (H) if only one ofT\,T 2 is empty- set free, 
then e! = null. 

Based on Proposition^ we can develop a polynomial on-line strategy for 
solving the schema-finding problem with loose mark-up encoding, with a slightly 
more demanding notion of richness, and a larger buffer. We say that a set of 
instances I of a type cr is strongly rich for a if the subset of I of empty-set free 
instances is rich for a. Intuitively, as far as sets are concerned, this corresponds 
to requiring that, for each set subtree of cr, there are at least two different non- 
empty cardinalities. 

In this case, an on-line algorithm - let us call it schemaFinding2 - for solving 
the schema-finding problem in case of loose mark-up encoding can be informally 
described as follows; suppose ci, . . . , e„ denote the input: 

1. schemaFinding2 works On a data structure which is an array of templates. 
Tempi, which initially contains ei; 

2. each 6,, i = 2,. .. ,n is sequentially compared with templates in Tempi us- 
ing Algorithm matchSchema; we say that an Ci matches a template T if 
matchSchemaCr, Ci) is not null; if a matching template is found, the match 
is recorded and the next instance is examined; if no matching template is 
found in Tempi, Ci is added to Tempi as a new template; 

3. at the end, several regular expressions will be generated in Tempi; if the input 
was strongly rich, there is at least one expression in which all sets have 
been marked; a simple scan of Tempi allows to identify e^, since it is the 
expression with the maximum number of marked sets (corresponding to -I— 
subexpressions); by replacing each -|- with * in Co-, we have a grammar for 
parsing encoded instances and build representatives. 

The difference between algorithms schemaFindingl and schemaFinding 2 is that 
the latter has a buffer-size bounded by the size of the loaded tuples (versus the 
size of a single tuple) and its running time is bounded by a polynomial in the 
size of the loaded tuples. 
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4 Schema Finding on the Web 

Frequently, Web sites have a rather tight structure; for example, a large number 
of organizations are now providing a Web access to their information systems and 
databases. Pages are generated on-the-fly by executing queries on the database 
and returning results in HTML format, which clearly reflect the regularities 
in the initial database. However, schema finding techniques are needed to re- 
discover such regularities in flies. 

We claim that the techniques presented in this paper, enriched by simple 
heuristics, can effectively support schema finding on the Web. The notion of 
mark-up encoding is in fact inspired by SGML-like mark-up languages, of which 
HTML is an example. In these languages, tags - i.e., clearly identifiable pieces 
of code - are used to mark data inside the document, with two main goals: 
defining the logical role of a piece of information (like titles, paragraphs, lists etc.) 
and contribute to the layout on the browser. Consider for example a university 
department Web site which has been generated automatically starting from a 
database, a case that is becoming rather frequent. Suppose, for example, the 
database contains a nested relation Professor {[Name, email, ResearchGroup, 
Courses: {[Code, Title\{]\, and that for each professor, i.e., for each tuple in the 
relation, an HTML page like the following has been generated: 

<HTMLXBODY BGC0L0R="FFFFFF"> 

<H1XIMG SRC= "dept symbol. gif "> 

John Doe </Hl> 

<TT> doe@inf.uniroma3.it </TT> 

<HRXA HREF="db.html"> 

Database Group </A> 

<HRXUL> 

<LIXB> 0S134 </B> 

<I> Operating Systems </I> 

<LIXB> DB201 </B> 

<I> Databases </I> 

<LI>. . . 

</ULX/BODY> </HTML> 

These HTML pages can be easily turned into mark-up encodings of the orig- 
inal tuples. In fact, roughly speaking, tags are used in the page to encode the 
original database structure, and strings to encode data. First, constants in the 
database are strings, i.e, words in the latin alphabet (let us denote it Siatin) and 
are therefore encoded by themselves, that is, function S is the identity function. 
Second, we may use a simple pre-processing of HTML code to distinguish tags 
from other strings and replace the former with special tokens from an alphabet 
Stag- Note that this has to be done with special care, since URLs, i.e., refer- 
ences to other pages, are embedded inside tags <A> and </A>; since URLs are 
to be considered as meaningful attributes of a page, the tokenization process 
should extract URLs from <A> tags and treat them as strings in the data alpha- 
bet, Siatin- A similar treatment is needed for images. In this way, pages in our 
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departmental site are transformed into strings over {Siatin U Stag)* which are 
actually mark-up encodings, according to the definition given in Section |21 
Now, simple heuristics can be used to identify collections of pages with ho- 
mogeneous structure inside the site; we will get to this later on in this section; 
for now, suppose that in the department site there is a page containing the list 
of all professors, with links to their respective pages, which can thus be easily 
identified. We can run our algorithms on such collection of pre-processed pages, 
and derive the following grammar, from which the nested schema is easily found: 

<HTML> <BQDY BGCDLOR="FFFFFF"> 

<H1XIMG SRC= "dept symbol. gif "> Si^un </HlXHR> 

<TT> Statia </TT> <A HREF=" "> </A> 

<HRXUL> «LIXB> </B> <I> </I»* </UL> 

</B0DY> </HTML> 



After this schema has been derived, a human operator can examine the 
schema and ~ by looking at actual pages on the site - associate a semantics 
to each attribute, allowing in this way to tell names from e-mails. Note also that 
a parser for the grammar is a natural wrapper for the pages. 

It might be argued that the example above has a rather strong separation 
between data and structure in the page. Indeed, in many cases, this separation 
will be less strict. In fact, to improve readability, often the role of data inside 
a page is marked not only using tags, but also by means of attribute names, 
comments, remarks. With respect to the HTML page above, beside tags, also 
constant strings - like “e-mail” , “Research Group” , “List of courses” - could be 
used in the encoding. The presence of these “metadata” can somehow complicate 
the schema finding process, since pages will contain pieces of information that are 
not in the original database. Interestingly, the techniques developed in Section 0 
allow to deal correctly with these situations. In fact, our algorithms will detect 
that all pages of professors contain the same constant strings, which therefore 
represent metadata and can be excluded from the schema. (These strings com- 
mon to all pages can be nicely used instead to automatically suggest a semantics 
for attributes.) 

As mentioned above, another important issue in this context consists in find- 
ing groups of homogeneous pages in a site, which can then be processed by the 
algorithms. There are several empirical ways to do this: for example, by taking 
collections of pages inside the same physical directory, or all pages linked to the 
same list. Also, several tools exist that navigate a site and produce a map of the 
site content, thus giving a graphical representation of the site topology, which 
may help in identifying homogeneous groups of pages. Let us briefly mention 
that, under the reasonable assumption that pages containing different informa- 
tion do have different features, our algorithms can be easily extended in such a 
way as to load the whole site, compare pages and cluster them in different types. 
In fact, it is easy to see that, whenever two pages of different types, contain- 
ing perhaps different tags are compared, the algorithms returns a null common 
schema, thus identifying the mismatch. 
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5 Conclusion 

The algorithms have only been tested on a few samples yet. Nevertheless, we 
are rather optimistic. Indeed, the number of tuples that are necessary to recover 
the schema of a nested relation seems very low. Consider random instances with 
a probability p that, for a given label a, two instances have all sets with label 
a of equal cardinality. Then, the probability that a collection of n instances 
has all sets with label a of the same cardinality is (1 — Assuming that 

the probabilities are independent for different labels, then, the probability, that 
a collection of n instances with k set labels is set-rich, is (1 — So, for 

example if p = ^ , the probability that the schema of a relation with 5 sets in its 
type is found after looking at 10 tuples is 99% ! The previous observation leads 
to a linear time algorithm for extracting the schema and loading the tuples in 
the right format. Indeed, once the schema is obtained (in an amount of time 
which in average depends only upon the schema size), checking that a tuple is 
of the right form, and loading it can be done in linear time. 

In the present paper, we have concentrated on abstract data consisting of 
nested relations. A natural and useful extension would be to allow union types 
and therefore heterogeneous sets to model semi-structured data jAbi97) which, in 
this context, seems fundamental. The algorithms developed here can be adapted 
to nested heterogeneous sets with the following outcome. We were able to extract 
a type which is subsumed by the original nested type with union, but we have 
no formal results yet in this case. 

We are now implementing our algorithms as part of a system for schema- 
finding on the Web. The system downloads HTML pages from a target Web site 
and tries to infer their common structure. Although the techniques we propose 
are far from being a final solution to this problem, our experiences tell us that 
they represent a promising starting point. In fact, in all cases in which pages 
do comply with a tight structure, our algorithms have been able to correctly 
recover the type. Also, we show that in average only a small number of tuples is 
needed to identify the schema, thus providing a linear behavior of the method. 
We are now working to extend the framework in order to handle pages with 
“exceptions” |CM98j . i.e., pages containing features that slightly depart from 
the common type. 
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Abstract. The foundational homomorphism techniques introduced by 
Chandra and Merlin for testing containment of conjunctive queries have 
recently attracted renewed interest due to their central role in informa- 
tion integration applications. We show that generalizations of the clas- 
sical tableau representation of conjunctive queries are useful for comput- 
ing query answers in information integration systems where information 
sources are modeled as views defined on a virtual global schema. We 
consider a general situation where sources may or may not be known to 
be correct and complete. We characterize the set of answers to a global 
query and give algorithms to compute a finite representation of this pos- 
sibly infinite set, as well as its certain and possible approximations. We 
show how to rewrite a global query in terms of the sources in two spe- 
cial cases, and show that one of these is equivalent to the Information 
Manifold rewriting of Levy et al. 



1 Introduction 

Information Integration systems fnibTj aim to provide a uniform query interface 
to multiple heterogeneous sources. One particular and useful way of viewing these 
systems, first proposed within the Information Manifold project |Lb096j , is to 
postulate a global schema (called a world view) that provides a unifying data 
model for all the information sources. A query processor is in charge of accepting 
queries written in terms of this global schema, translating them to queries on 
the appropriate sources, and assembling the answers into a global answer. Each 
source is modeled as a materialized view defined in terms of the global relations, 
which are virtual. Note the reversal of the classical model: instead of thinking 
of views as virtual artifacts defined on stored relations, we think of the views as 
stored, and the relations that the views are defined on as virtual. Q 

^ Whether the views are actually stored at each source or materialized by other means, 
and whether they consist of relations or semi-structured objects or files are issues 
irrelevant to our discussion and hidden from the query processor by appropriate 
wrappers. 
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A question of semantics now arises: what is the meaning of a query? Since 
a query is expressed in terms of the global schema, and the sources implic- 
itly represent an instance of this global schema, it would be natural -at least 
conceptually- to reconstruct the global database represented by the views and 
apply the query to this global database. There are at least two issues that must 
be resolved for this to work. 

First, the database represented by a set of sources may not be unique; in 
fact, it may not even exist. For two trivial examples: first, suppose we have 
a single information source, which is defined as the projection on attribute A 
of the global binary relation R(A, B). For any given set of tuples stored at the 
source, there are many (perhaps an infinite number) of possible global databases. 
Second, suppose we have not one, but two sources, both storing the projection 
on A as before; one contains the single tuple (oi), and the other one the single 
tuple (02); then there is no global database whose projections equal these two 
sources. In sum, the first issue is: what database or databases are represented 
by a given set of sources? 

Second, suppose we have identified the set of databases that are represented 
by a given set of sources. Applying the query to each database and producing 
all possible answers may be impossible (e.g. if there is an infinite number of 
such answers) or undesirable. The second issue is: how do we produce a single 
compact representation of these multiple answers, or an approximation to them 
if we so desire? 

In this paper, we explore answers to both questions. Before overviewing the 
results, let us make the example above more concrete. Consider a schema for 
storing information about the first round of the World Cup Soccer Tournament. 
Suppose global relation Team(Country, Group) represents a list of all teams giv- 
ing the name of the country and the group to which the country has been assigned 
for first round play. 

Suppose first that the only source, St cam, stores a unary relation listing all 
the countries that are participating in the first round. The corresponding view 
mapping is given by the conjunctive query: 

STeam{x) ^ Team{x,y). 

What global databases are represented by SreamS They are all the relations 
Team such that the view mapping applied to Team produces exactly the given 
instance of Sream, that is: 

STeam — ’^CountryiToUTTl). 

In this case, we say the view is both sound and complete. 

On the other hand, suppose the only source is Squai, which contains the list of 
all teams that participated in the qualifying round. This is a strict superset of the 
teams that will actually be playing in the tournament. Since the global schema 
says nothing about the qualifying round, the only reasonable view mapping is 
still 



SQuai{x) ^ Team{x,y). 
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However, now we understand that this is just an approximation, and that the 
actual database could be any relation whose projection on Country produces a 
subset of the source that is: 

^Qual ^ C ountryi^' CUTTl) . 

In this case, we say the view is complete (since it lists every possible team) 
but not sound (since it lists teams that are not in the first round.) 

Finally, suppose the only source is Srube, listing those teams whose games 
will be televised. Again, the best way to represent the view mapping (since there 
is no information about television in the global schema) is by 

Smbeix) ^ Team{x,y). 

In this case, every team listed in Srube corresponds to some tuple in the ideal 
Team relation, but there are tuples in this ideal relation not represented in STube- 
Thus, we take as the set of represented databases all the relations Team that 
satisfy 



^Tube C 'X C ountryiT eUTTl) . 

In this case, we say the view is sound, but not complete. 

In sum, we will annotate each source, not only with the view mapping that is 
associated with it, but also with the knowledge of whether this view is guaranteed 
to be sound, or complete, or both. We will say a global database is represented 
by a given set of (annotated) sources when applying the view mappings to this 
database produces a subset of every complete source and a superset of every 
sound source. 

In the example, if we had both the complete source SquuI and the sound source 
Srube, the databases represented by them would be all those that contained at 
least all the countries in Srube and at most those in Sguai ■ If we had the sound 
and complete source Sream (with or without the other two), then databases 
would be constrained to contain exactly the teams listed in this source. Note 
that it is possible for a given set of source instances to be inconsistent with the 
source annotations; for example, if S'ream mentions a team not listed in Squai, 
or omits one mentioned in Sxube- In such a case, the set of represented databases 
is empty. 

The outline of the paper is as follows. In Section 2 we formalize the model 
above. In order to finitely represent infinite sets of databases, we introduce tem- 
plates, and show how to associate with each set of source instances a template 
that characterizes all databases represented by the sources. A template is a vec- 
tor of consisting of tableaux !ASU79| and a set of constraints. Our constraints 
are related to the egd’s of Beeri and Vardi EYHIl- In Section 3 we present our 
results. First we give a syntactic characterization of consistent sets of sources in 
terms of their templates. Then, we give an algorithm for evaluating a conjunctive 

^ Actually, only relations with 32 tuples in them qualify, since there are 32 teams in 
the first round, but we ignore this level of detail in our model. 
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query, expressed on the global schema, using only the given source instances. The 
algorithm uses the template generated from the source instances and produces 
as output another template that describes all the answers, one for each repre- 
sented database. We also show how to compute two natural approximations to 
this set: the union (the approximation by possibility) and the intersection (the 
approximation by certainty) of all the answers. 

Next, we study the complexity of conjunctive query evaluation. We show that 
consistency is coNP-complete in general, but consistency and query evaluation 
become polynomial when we have only sound or only complete sources. These 
results extend and generalize earlier ones of Abiteboul and Duschka HIM|. 

Finally, in Section 4 we show how to rewrite a query on the global schema 
in terms of the sources, in two cases: when all sources are sound and the certain 
semantics is of interest, and when all sources are complete and the possible 
semantics is sought. This is an optimization over building the template and 
applying the generic method, since the cost of building the template is linear 
in the size of all the source instances but there may be only a few out of many 
sources that are relevant to the query; the rewritings will only use the relevant 
sources. We show that the all-sound, certain-answer rewriting is equivalent to 
(although different from) the Information Manifold rewriting described by Levy 
et al. in [LTVT^95| without any formal justification. In fact, this was the original 
motivation for our work: providing semantics for the mysterious Information 
Manifold algorithm. 



2 The Model 

In this section we first introduce global databases. A global database is a stan- 
dard multirelational database. We then introduce database templates, a tool for 
representing sets of global databases. Finally, we define the concept of source 
collections, modeling a set of sources that we want to amalgamate. 

For basic terminology and concepts regarding databases we refer to IAHVtl,^l . 



Global Databases 

Let rel be a countably infinite set {R, S', ... , i?i, i?2, . . . , Si, S2 . . .} of global rela- 
tion names, Let dom be a countably infinite set {a, 6, . . . , ai, 02, . . . , 61, &2, ■ • ■} of 
constants, and let var be a countably infinite set {a;, y, . . . , X\,X2, . . . ,yi,y2, ■ ■ ■} 
of variables. 

Associated with each relation name i? is a positive integer arity{R), which 
is the arity of R. A fact over R is an expression of the form R{ai, . . . , Ofc), where 
k = arity (R). 

Let R = {i?i, i?2, ■ ■ ■ , Rn} be a set of global relation names. A set of global 
relation names will sometimes also be called a global schema. A global database 
D over R is a is a finite set of facts, each fact being over some Ri G R. 
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Template Databases 

We now introduce templates, which are a generalization of tableaux intended to 
represent concisely a large or infinite number of possible database instances. 

Let i? be a global relation name. An atom over R is an expression of the form 
i?(ei, . . . , efc), where k = arity(R) and is in domUvar. Note that an atom 
containing only constants is the same as a fact. 

Let R i? 2 , • ■ • ) Rn} be a set of global relation names. A tableau T over 

R is a finite set of atoms over the Rfs. Note that the same variable might appear 
in several atoms in T. 

A constraint over R is a pair (JJ,0), where C/ is a tableau over R and 0 is 
a finite set of substitutions of the form {xi/oi, 0 : 2 / 02 , ■ • • , Xp/ap}, where all the 
xf.s are distinct variables appearing in U . 

Finally, a database template T over R is a tuple (Ti, T 2 , . . . , T^, C), where 
each Ti is a tableau over R and C is a finite set of constraints over R. 

Example 1. Let T = {Ti,T 2 ,C), where Ti = {R{a,x), S{b,c), S{b' ,c)}, T 2 = 
b'),S(b' , c)}, and C = {(i?(a, z), {{z/b}, {z/b'}})}. This template contains 
two tableaux, and one constraint with two substitutions. 

A valuation, is a finite partial mapping from var U dom to dom that is the 
identity on dom. Valuations, like Xi 1 -^ a^, for i € [l,p], will usually be given 
in the form of substitutions {xi/ai, . . . , Xp/op}. The identity on constants is 
omitted in this notation. Two valuations are compatible if they agree on the 
variables on which both are defined. 

A database template T on schema R = {Ri, R 2 , . . . , Rn} represents a set of 
global databases on R. This set is denoted rep{T), and it is defined by 

rep(T) = {D : there is a valuation d and a tableau R in T, 
such that 'd{Ti) C D, 

and for all {U,0) G C in T, and valuations cr, 

if a{U)CD then there is a, 0G0 such that a and 0 are compatible}. 

The definition says that a database D is represented by template T if there 
is a tableau R in T, such that when all variables in R are replaced by some 
constants, the set of facts thus obtained is a subset of D. This is the “sound” part 
of T. Furthermore, the database D has to satisfy the “completeness” restrictions 
encoded in the constraints {U, 0) in C\ Whenever it is possible to embed the 
tableau U in D, the mapping used for the embedding must be extendible to one 
of the substitutions in 0. 

Example 2. Consider the template T = 0T\ , T 2 , C) from Example Ql The tem- 
plate T represents the three databases {i?(o, b), S{b, c),S{b', c)j, {R{a, b'),S{b, c), 
S{b', c)j, {R{a', b'), S{b', c)|, and any of their supersets satisfying the constraint 
saying that whenever a occurs in the first component of R, then the second 
component has to be b or b' . For instance, {R{a, b), R{a, b'), S{b, c), S{b', c)j is a 
database in rep{T), while {R{a, c) , R{a,b') , S{b, c) , S{b',c)} is not in rep{T). 
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Note the disjunctive role of the several tableaux in a template, similar to the 
flocks of Note also that the constraints can encode the pure closed 

world assumption. For instance, the closed world interpretation of a relation with 
the two facts R{a,b) and R{c,d), can be expressed as the template {T,C), with 
T = {R{a, b), r{c, d)}, and C = {({i?(a:, y), {{x/a, y/h}, {x/c, y/d}})}. 

Chasing a Tableau 

We now introduce a binary relation between tableaux, where the relation 

is parameterized by a constraint with a singleton binding set. The idea is that T 
and T' are related by 'R-(u,{e}) when U can be embedded into T by a valuation 
and then this valuation and 9 are applied to T to obtain T' . More formally, 
we say that U applies to T through a when there is a valuation a such that 
a{U) C T. Then T satisfies {U, {0}) wrt a iiU applies to T and 9{U) and cr{U) 
can be unified. Let T and T' be tableaux. Then T T', if and only if 

U applies to T through a and, either T satisfies {U, {0}) wrt cr, in which case 
T' = mgu{9{U),a{U)){T), or T does not satisfy (C/, {0}) wrt cr, in which case 
T' = _L.l 

Then let {U,0) be a general constraint. We define TZ(u, 0 ) = 

If C is a set of constraints we set TZc = [J{u 0 )eC '^(u,0)- Finally, let TZq be the 
transitive closure of TZc- 

It is straightforward to show that TZc is well-founded, i. e. for any tableau 
T, the set {T' : T TZq T'} is finite. This set therefore contains some maximal 
elements, i. e. tableaux T' , such that for no tableau T" does T' TZc T" hold. 
We can now define the chase of a tableau T with a set of constraints C, denoted 
C{T), to be the set of maximal elements in {T' : T TZq T'}, provided the 
last set is nonempty. Otherwise C{T) = {T}. In the template {Ti,T 2 ,C) from 
Example^we have C(Ti) = {{i?(a, 6), S'(6, c), S'(6'c)}, {i?(a, b'), S{b, c), ^(b'c)}}, 
and C(T 2 ) = T 2 . For another example, let T = {R{a,x),R{y,d)}, and let C = 
({i?(a, u)}, {u/b}, {u/b'}) U ({i?(u>, d)}, {w/c\, {w/c'}), Then one maximal chain 
in the “chasing sequence” is 

{i?(o, x), R{y, d)} TZc {R{a, b), R{y, d)} TZc {d?(a, b),R{c, d)}. 

Another maximal chain is 

{R{a, x),R{y, d)} TZc {R{x, b), R{d , d)} TZc {R{a, b),R{c', d)}. 

It is easy to verify that 

C{T) = {{R{a,b),R{c,d)}, 

{R{a,b),R{c',d)}, 

{R{a,b'),R{c, d)}, 

{R{a,b'),R{c',d)}}. 

In general, the chase has the following property. 



® mgu means most general unifier, see e. g. , and T represents the inconsistent 

tableau; no U can be applied to T, and no valuation can be applied to T. 
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Theorem 1. Let T = (Ti, . . . , T„, C) be a database template, and let {T[, . . . , T^} 
= Ui6[i,„]C'(7i)- ThenrepiT) = rep{{T{, . . . ,T',C)). 

A fundamental question for a template database T is whether rep{T) is 
non-empty, that is, whether T is consistent. The next theorem gives a syntactic 
condition. 

Theorem 2. Let T = (Ti, . . . , T„, C) be a database template. Then rep{T) yf 0 
if and only if there is a tableau Ti in T , such that _L ^ C(Ti). 

Source Collections 

Let loc be a countably infinite set {V, U, . . .Vi,V 2 , ■ ■ ■ , Ui, U 2 , . . .} of local rela- 
tion names. The local relation names have arities, and atoms over local relation 
names are defined in the same way as atoms over global relation names. 

A view definition ip \s & conjunctive query 

head{ip) ^ body{ip), 

where body{(p) is a sequence 61 , 62 , • ■ • , bn, where each bi is an atom over a global 
relation name, and head{(p) is an atom over a local relation name V . We also 
assume that all variables occurring in headijp) also occur in bodylpp), i. e. that 
the query ip is safe. The conjunctive query p can be applied to a database D in 
the usual way, resulting in a set of facts p{D) over V . 

A view extension for a view p \s & finite set of facts over the relation name 
V in the head of the view definition p. Such an extension will be denoted v. 

A source S is triple consisting of a view definition p, a set containing one 
or both of the labels open and closed, and a view extension v for p. Sources 
are called open, closed, or clopen (closed and open) according to their labels. 
Sometimes open sources will be called sound, and closed sources called complete. 
Clopen sources will be called sound and complete. 

A source collection 5 is a finite set of sources. The (global) schema of S, de- 
noted sch{S) is the set consisting of all the global relation names occurring in the 
bodies of the defining views of the sources in S. The description of S, denoted 
desc{S) is obtained from S by dropping the extension from every triple in S. The 
extension of a source collection S, denoted ext{S), is the union of all view exten- 
sions in it. In other words, a source collection S has two “schemas,” sch{S) which 
is the “global world view,” and desc{S), which describes the sources’ sound- 
ness/completeness and defining views. Source collections S have one extension, 
ext{S), which for each source Si is a set of facts over the defining view-head 
relation name. 

A source collection S defines a set of possible databases, denoted poss{S), as 
follows: 



poss{S) = {D over sch{S) : 

Vi C pi{D) for all open sources Si G S 
Vj 2 Pj{D) for all closed sources Sj G S} 
Vk = Pk{D) for all clopen sources Sk G S} 
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Note that there are source collections, such as S containing the two sources 
({Vi(a:) ^ R{x,y),{open},{Vi{a)}), and {V2{u) ^ R{u,w), {closed}, {V2{b)}), 
for which poss{S) is empty. This is because the extension of Vi requires that 
there be a tuple in every possible database with an a in its first column, while 
the extension of V2 requires that every possible database must have only b's 
in its first column. Such source collections are called inconsistent. If poss{S) is 
nonempty the source collection is said to be consistent. We will see in Theorem 
2] below that consistency of source collections can be characterized syntactically 
through database templates and the chase. 

3 Querying Source Collections 

In this section we first show how templates can be used to query source col- 
lections, by extending conjunctive queries to operate over templates. We then 
analyze the computational complexity of querying source collections. The com- 
plexity turns out to be coNP-complete. 



Query Evaluation 

Let S be source collection, and Q a conjunctive query, such that the atoms in 
the body of Q are atoms over relation names in sch(S), and the head of Q is an 
atom over an relation name in the set {ans, ans\, ans2, ■ • ■}. Now Q applied to 
S defines the following set of possible answers: 

Q{S) = {Q{D) : D£poss{S)}. 

This (exact) answer can be approximated from below as 

q*( 5) = n Q{D), 

DGposs(S) 

and from above as 

Q*{S) = U Q(D). 

D^poss{S) 

The problem begging for a solution is how to compute the answers to queries 
on source collections. Towards this end we shall first construct a database tem- 
plate over the global relation names. This database template will represent the 
set of possible databases for S. Let R be a global schema. We define a func- 
tion (T, C) from sources with defining view ip, where the body of ip consists of 
atoms over relation names in R, to template databases over R. Given a source 
Si = {ipi, labels, Vi), we set T{Si) = {t : t in body{ipi )9 and head{ipi )9 = 
u, for some u G Vi and assignment 9 }, provided openGlabels. Otherwise T{Si) = 
0. If closedG labels we set C{Si) = (U,0), where U consists of all atoms in the 
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body of and 0 = {0 : head{ipi)9 = u, for some u € Vi}. Otherwise 
C{Si) = 0. Finally, we set 

T{s) = / y ns.), u c(5.) 

\SiGS SiGS 

Example 3. Consider 5 = {S'!, S' 2 , S'a}, where S'! = {V\{u,w) <— S{u,w), {open}, 
|Fl( 5 , c), Vi(5',c)}), S 2 = \v2{v) ^ R{v,x),{open},{V 2 {a)}), and S 3 = (^3(2) 
^ R{a, z), {dosed}, |V 3 ( 5 ), V 3 ( 6 ')|). Then T (5) = (Ti, C), where Ti = {R{a, x), 
S{b,c),s[b',c)}, and C = {{R{a,z),{{z/h},{z/h'}})}- 

Note that T\ and C above are from Example 0 The T -function will always 
construct templates with a single tableau. Templates with multiple tableaux 
are produced by the chase process or by query evaluation. Query evaluation is 
explained below. 

The database templates constructed by the function T have the following 
desirable property. 

Theorems. rep{T{S)) = poss{S). 

We will now return to the issue of consistency of a source collection. Using 
the characterization in Theorem |2| together with Theorem 0 we can prove the 
following. 

Theorem 4. Let S he a source collection and T{S) = {T,C). Then S is con- 
sistent if and only if E ^ C’(T). 

Corollary 1. If no source in a collection S is closed, then S is consistent. Like- 
wise, if no source in a collection S is open, then S is consistent. 

For an example of an inconsistent source collection, consider the collection 
S from Section 2, where S contains the two sources (|Vi(a:) <— R{x,y), {open}, 
{Ui(a)l), and (V 2 {u) <— R{u,w), {dosed}, {V 2 {b)}). We will now have T(5) = 
{T,C), where T = {R{a,y)}, and C = {{R{u,w)},{{u/b}}). It is easy to see 
that C{T) = T, and that poss{S) is empty. 

We then return to the issue of evaluating queries over global relations given 
only the local relations in a source collection S. Since we are able to construct 
a database template T representing all databases in poss{S) it is is natural 
to extend the standard query evaluation mechanism to operate on database 
templates. 

Let T = (Ti , . . . ,Tn,C) be a database template over R. Given a conjunctive 
query Q = ans(x) *— Ri{xi), . . . , Rnixm) where each Ri is a relation name in 
R, our evaluation proceeds as follows. Let {T{, . . . ,T{} = Uig[i n] C'(Ti). Then 

define Q{T) = {Tf , . . . ,T” ,%) , where each T” , j G [1,9] is equal to a Qk{T'), 
for some i G [1, s] and substitution k. Here 

Q,i{T') = {ans{x)a : a is an mgu for {Rifxi)}^^ and subset of T{, 
and cr is compatible with k}. 
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Note that there can be only finitely many distinct (modulo renaming). In 

computing the unifiers used in Qf^(T') we always try to substitute variables or 
constants from T' for variables in the query. For the inconsistent tableau _L, we 
set Qk(-L) = -L. 

As an illustration, suppose our query Q is ans{z,u,w) <— R{z,u), S{u,w). 
If our database template T is (Ti , T 2 , C) from Example Ql then the query eval- 
uation proceeds as follows: First we chase the tableaux in T with C obtain- 
ing the three tableaux {R{a,b),S{b,c),S{b',c)}, {R{a,R), S{b,c), S{b'c)}, and 
{R{a', b'),S{b', c)}. Then we apply to each of these tableaux, getting the re- 
sults {ans{a,b, c)}, {ans(a, 6', c)}, and {ans{a' ,b' ,c)}. In this example we only 
need to use one substitution k, namely the identity substitution {}. For another 
example, let T = {T,C), where T = {R{a,x), R{a' ,x), S{b,c), S{y,c')}, and 
C = 0. For our example query Q, we get Q{x/b}{T) = {ans{a, b, c), ans{a', b, c)}, 
Q{x/y}{T) = {ans(a, y, c'), ans(o', y, c')}, and Q{}(T) = 0. These are all the 
answer tableaux; no other substitutions n will yield more. 

The query evaluation uses the constraints in that the constraints are chased 
into the tableaux before querying. The template resulting from the query eval- 
uation will have the empty constraint. It is clear that we thus loose some infor- 
mation. However, the query result will be coinitially equivalent to the desired 
one. Two sets of databases are coinitial if the have the same set of C minimal 
elements (see |IL84p. The coinitiality relation will be denoted «. Our extended 
semantics now has the following correctness property: 

Theorem 5. rep{Q(T)) « Q{rep(T)), 

where the query on the right-hand side is evaluated pointwise. 

Let 5 be a source collection and Q be a query on sch{S). Together Theorems 
0 and El give the following method for evaluating Q on S. 

Theorem 6. 1. Q{S) « Q{'T{S)). 

2. Qt{S) = nrep{Q{T{S))). 

3 . g*(5) «Urep(Q(T(5))). 



The Computational Complexity of Querying Sources 

In general, template databases have the following complexity. 

Theorem 7. Given a database template T , testing whether rep(T) 0 zs coNP- 
complete. 

For the lower bound we use a reduction from the non-containment of tableau 
queries, which is known to be a coNP-complete problem |CM77IASU7ii| . Let 
Ui and U 2 be tableau queries. Without loss of generality we can assume that 
the summary rows of U\ and U 2 are the same, say (a;i, . . . ,Xp). Our task is to 
construct a template T, such that rep{T) yf 0 if and only if Ui is not contained 
in C/ 2 . Let therefore T = {T,{U,{9})), where T equals the rows of U\, except 



342 



Gosta Grahne and Alberto O. Mendelzon 



that each summary variable Xi has been substituted with a distinct constant Oi. 
Let U be the rows of U 2 , and 0 = {x\/bi, . . . ,Xp/bp\, where yf bi. Now it is 
easy to see that T is inconsistent if and only if there is a containment mapping 
from U 2 to Ui- 

As outlined in Theorem we have a general method that given a source 
collection S evaluates the answer (or the exact or possible answer) on the tem- 
plate iT(5). On the other hand, TheoremQsays that a template database might 
have an intricate combinatorial structure. In the template resulting from query 
evaluation we have chased out this combinatorial structure. It is clear the the 
resulting template can be intractably large. The following theorem says that 
there probably is no way to overcome this intractability. 

Theorem 8. 1. If S contains both sound and complete sources the problem of 

testing whether poss{S) is nonempty is coNP- complete. 

2. If S contains both sound and complete sources the problem of testing whether 
a given tuple t is in Q*{S) is coNP- complete. 

3. If S contains both sound and complete sources the problem of testing whether 
a given tuple t is in Q*{S) is coNP -complete. 

As an aside we note that Theorem 5.1 in mm falls out as a special case 
of Theorem El 

If S contains only sound or only complete sources query evaluation becomes 
tractable. 

Theorem 9. If S contains only sound or only complete sources the consistency 
of poss{S) as well as Q»(iS) and Q*(S) can be tested/computed in time polyno- 
mial in the number of tuples in the extension of S. (In the case where Q*{S) is 
infinite we only report this fact, we don’t output all tuples.) 

4 Query Optimization through Rewriting 

In practice one would be most interested in computing Q*(5) or Q*{S). It is clear 
that computing Q^.{S) or Q*{S) from Q{T{S)) might involve a lot of redundant 
work, since computing T(5) amounts to constructing the tableaux and con- 
straints corresponding to all view-extensions, whereas the global relations men- 
tioned in the body of Q might occur in only a few view definitions. Furthermore, 
the query might have selections and joins that could be computed directly at the 
sources. Take for example the open view definitions ipi =df V\{x,y) <— R{x,y), 
=df V 2 {u,v) <— S{u,v). and the query Q =df ans{a,z) ^ R{a,w), S{w, z). 
Obviously it is more efficient to “push the selection and join,” and evaluate the 
query ans{a,z) <— Vi{a,w),V 2 {w, z), on the view extensions vi and V 2 , rather 
that obtaining Q*(iS) from Q computed on the tableau constructed from vi and 
V 2 and possibly a large number of other sources. 

We shall thus consider the following problem: 

Let {.}) . . . , {(fn, {■})} be a set of source descriptions where the 

view definitions have heads Vi, . . . ,Vn, and let {Ri,...,Rm} be the relation 
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names occurring in the bodies of the tpi'S. Given a conjunctive query Q where 
body{Q) contains only global relation names in {i?i, . . . , Rm\ find a query rew{Q), 
where the body of rew{Q) only mentions view relation names in {Vi, ... ,Vn}, 
and head{rew{Q)) is compatible with head{Q), such that the following condi- 
tion holds: For all source collections S with desc{S) = <P and ext{S) = it is 
true that 

1. (Certain answer): rew{Q){v) = Q*(5), or 

2. (Possible answer): rew{Q){v) = Q*{S). 

The queries rew{Q) above are called rewritings of Q with respect to If the 
rewriting satisfies condition 1 it is called a certainty rewriting, and if it satisfies 
condition 2 it is called a possibility rewriting. 



Certainty Rewritings 

Given a query Q on {i?i, . . . , Rm} we shall now construct the desired rewriting 
on {Vi, . . . , Vn} for the certain answer. It is easy to show, that if all views are 
closed, then the certain answer is always empty. We will consider the case where 
all views are open. 

The rewriting for the certain answer can be obtained according to the follow- 
ing recipe: Let L be a subset of the atoms in the bodies of the view-definitions 
ipi, . . . , ipn, such that L unifies with the set of atoms in body{Q). Let the mgu 
that achieves this unification be 0. If 0 does not equate any view variable x to 
any other view variable or constant, unless x appears in the head of its view, 
then do the following. Choose a subset of the view queries such 

that every atom in L occurs in some body{(fi.), every body{(pi-) contains at least 
one atom in L, and such that the query 

Q' = head{Q)6 ^ head{Lpi^)0 , . . . , head{ipii, )0, 

is safe (i. e. all variables in the head occur in the body). Consider now the queries 
Q' and 



Q” = head{Q)6 ^ body{(p^^)6, . . . ,body{tpi^)9 . 

It is clear that if Q' exists then the query Q” is the composition of the query 
Q' with the view-definition , • ■ ■ , Let rew\{Q) be the union of all all 
possible Q'-.s thus constructed □ 

The query rew\{Q) is the rewriting of Q achieving the desired result. For- 
mally we have: 

There is a finite number (modulo renaming) of such Q':s since each one is based on 
a subset L of the atoms in the bodies of the view queries, and there is a unique (up 
to renaming) mgu for each pair of atom sets. Furthermore, there is a finite number 
of ways of choosing a “covering” subset of the view-queries given L. 
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Theorem 10. Let S be a collection of open sources where ext{S) = v, and 
let Q be a query on sch{S). Then 

rewi{Q){v) = Q*(5). 

For a simple example, suppose the global database schema has one relation 
Flights(Source, Destination, Airline). We have two open sources: The flights by 
Air Canada, and the ones by Canadian Airlines. The deflnitions of these views 
are AC{x,y) ^ Flights{x,y,ac), and CP{x',y') ^ Flights{x' ,y' ,cp). Suppose 
we would like to And all connections from Toronto to Jerusalem with one change 
of planes. The query Q i^ 

ans(toronto, u, v, rt, Jerusalem, w) <— 

Flights(toronto, u, v), Flights(u, ierusalem, w). 

Now rewi{Q) will be the union of four queries, of which for instance 

ans(toronto, u, ac, u, Jerusalem, cp) <— AC(toronto, u), CP (m, J erusalem), 

is obtained through the mgu {x/toronto, y/u, u/ac, y' /Jerusalem, w/cp}. 

The union of the four queries gives the certain answer: indeed, the only con- 
nections we can for sure assert are those by Air Canada - Air Canada, Air 
Canada - Canadian Airlines, Canadian Airlines - Air Canada, and Canadian 
Airlines - Canadian Airlines. 

We have seen that rewi(Q) is a “sound and complete” rewriting of a query Q 
with respect to the certain answer on a collection of open views. It turns out that 
rewi{Q) is equivalent to a rewriting technique used in the Information Mani- 
fold project |LM’"95j . Formally, let manifold{Q) be the union of all conjunctive 
queries P with no more atoms in their bodies than Q and no constants other 
than those appearing in Q or the view deflnitions, such that the expansion P' 
of P through the view deflnitions is contained in Q. Then we have: 

Theorem 11. rew\{Q) = manifold{Q). 

For algorithmic purposes note the following: Suppose that there are n sources, 
and that the size of the body of view-definition ipi is bi, with B = max {hi : i € 
[l,n]}, b = and the global query has a body consisting of m atoms. A 

straightforward implementation of the Information Manifold rewriting technique 
runs in time x n"*), whereas a straightforward implementation of our 

technique based on unification requires time 0(6™). We see that our unification- 
based rewriting technique is polynomial in the source descriptions and the the 
number of sources and exponential in the size of the global query. The Informa- 
tion Manifold rewriting technique is polynomial in the number of sources and 
exponential both in the size of the source descriptions and the global query. 
However, the following result -indicating that there is little chance of getting rid 
of the global query size in the exponent- follows easily from |LM’^95| . 

® In the example toronto, Jerusalem, ac, and cp are constants. 
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Theorem 12. Let = {(yji, {open}), . . . , {open})}, he a set of source de- 

scriptions. Given a conjunctive query Q using only the relation names in the 
bodies of the tpi:s, the problem of whether there exists a certainty rewriting of Q 
with respect to is NP-complete. 

Possibility Rewritings 

For the possible answer we note that it is straightforward to show that if all 
sources in S are open, then Q*{S) is always infinite. Here we treat the case 
where all views are closed. 

For this consider each body{ipi) of the view definitions ipi. For each contain- 
ment mapping hi- from body(ipi) to body{Q) set Q'i_ to head{Q) <— hi- {head{ipi)). 
If there are no containment mappings hi-, then no Q'i_ exists. Suppose wlog that 
we found containment mappings for i £ [l,k\,k < n, and for each such i we 
found rrii different containment mappings. We then set rew2{Q) = 

head{Q) <— h\^ (head{ipi )), . . . , hi^^ {head{(p\)) , 

. . . , 

/ifei (head{(pk)), ■■■, {head{(pk)) 

provided this query is safe. If it isn’t safe or if there are no Q'i-'.s, then rew2{Q) 
is undefined. 

We can now show that the rewriting has the desired property. 

Theorem 13. Let S be a collection of closed sources where ext{S) = v, and 
let Q be a query on sch{S). Then 

rew 2 {Q){v) = Q*{S). 

For an example, let us return to the World Cup Soccer Tournament from 
the introduction. Suppose we only have the closed source Sguai available. Recall 
that this source was defined by 

SQuai(x) ^ Team{x,y). 

Suppose we would like to have a list of all possible matches in the first round. 
Our query Q is then 

ans{u,v) ^ Team{u,w),Team{v,w). 

We are now able to compute the possible answer Q* by evaluating rew2{Q) = 
ans{u,v) ^ SQual{u),SQual{v) 

on the closed source. The rewriting was obtained as ans{u,v) ^ hi(SQuai{x)), 
h2{SQuai{y)), where h\ = {x/u,y/w} and /12 = {x/v,y /w}. Clearly, the pair of 
teams actually playing against each other in the first round will be found only 
among the elements of the Cartesian product Squai x Squai- 

As in Theorem o it turns out that the rewriting is intractable in the global 
query size. 
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Theorem 14. Let {dosed}), {dosed})}, he a set of source 

descriptions. Given a conjunctive query Q using only the relation names in the 
bodies of the q^i'.s, the problem of whether there exists a possibility rewriting of 
Q with respect to <L> is NP-complete. 



5 Related Work 



Many works deal with finite representations of infinite sets of databases, both 
in the constraint database literature pCKR,90ivdMfl,3| and in the literature on 
uncertain and indefinite databases lAKGfll Klra,9irT,84IMen^ . We will not go 



into details; for surveys of these fields see and |vdM98| . 

The work of Motro [IVIott)7] is close in spirit and motivation to ours; how- 
ever, Motro assumes the existence of a “real world” global database instead of 
modeling the uncertainty in the sources by associating a set of databases with 
them; this makes it difficult to give precise meaning to his ideas. 

In a recent paper |AD98| . Abiteboul and Duschka study the special case in 
which all sources are sound, or all sources are sound and complete, but do not 
formulate the more general problem. On the other hand, they consider more 
general view definitions and queries than conjunctive ones. 

The idea of sources as views on the global schema goes back to the GMAP 
project |t 5I94|. It was subsequently popularized by the Information Manifold 
project jbP09fill;M*95) . which introduced the query algorithm that we charac- 
terize in this paper. In a related paper, Levy |Lev96| studies sources that are 
complete over some subset of their domain and show how to use them to obtain 
exact answers. 

Finally, we mention two special cases: Suppose the global database consists 
of one relation, and all sources are defined by projections. Then if all sources 
are open the certain answer corresponds to the weak instance window function 
|MRW8fi] with no dependencies, and if all sources are clopen, then the only 
possible global database is the notorious universal relation EMa. 
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Abstract. In data integration systems, queries posed to a mediator 
need to be translated into a sequence of queries to the underlying data 
sources. In a heterogeneous environment, with sources of diverse and 
limited query capabilities, not all the translations are feasible. In this 
paper, we study the problem of finding feasible and efficient query plans 
for mediator systems. We consider conjunctive queries on mediators and 
model the source capabilities through attribute-binding adornments. We 
use a simple cost model that focuses on the major costs in mediation sys- 
tems, those involved with sending queries to sources and getting answers 
back. Under this metric, we develop two algorithms for source query se- 
quencing - one based on a simple greedy strategy and another based 
on a partitioning scheme. The hrst algorithm produces optimal plans 
in some scenarios, and we show a linear bound on its worst case per- 
formance when it misses optimal plans. The second algorithm generates 
optimal plans in more scenarios, while having no bound on the margin 
by which it misses the optimal plans. We also report on the results of 
the experiments that study the performance of the two algorithms. 



1 Introduction 

Integration systems based on a mediation architecture m provide users with 
seamless access to data from many heterogeneous sources. Examples of such 
systems are TSIMMIS P|, Garlic |5|, Information Manifold and DISGO 
In these systems, the sources. They translate user queries on integrated 
views into source queries and postprocessing operations on the source query 
results. The translation process can be quite challenging when integrating a 
large number of heterogeneous sources. 

One of the important challenges for integration systems is to deal with the 
diverse capabilities of sources in answering queries isim. This problem arises 
due to the heterogeneity in sources ranging from simple file systems to full- 
fledged relational databases. The problem we address in this paper is how to 
generate efficient mediator query plans that respect the limited and diverse ca- 
pabilities of data sources. In particular, we focus our attention on the kind of 
queries that are the most expensive in mediation systems, large join queries. We 
propose efficient algorithms to find good plans for such queries. 
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1.1 Cost Model 

In many applications, the cost of query processing in mediator systems is dom- 
inated by the cost of interacting with the sources. Hence, we focus on the costs 
associated with sending queries to sources. Our results are first stated using a 
very simple cost model where we count the total number of source queries in a 
plan as its cost. In spite of the simplicity of the cost model, the optimization 
problem we are dealing with remains NP-hard. Later in the paper, we show how 
to extend our main results to a more complex cost model that charges a fixed 
cost per query plus a variable cost that is proportional to the amount of data 
transferred. 



1.2 Capabilities-Based Plan Generation 

We consider mediator systems, where users pose conjunctive queries over inte- 
grated views provided by the mediator. These queries are translated into con- 
junctive queries over the source views to arrive at logical query plans. The logical 
plans deal only with the content descriptions of the sources. That is, they tell the 
mediator which sources provide the relevant data and what postprocessing oper- 
ations need to be performed on this data. The logical plans are later translated 
into physical plans that specify details such as the order in which the sources are 
contacted and the exact queries to be sent. Our goal in this paper is to develop 
algorithms that will translate a mediator logical plan into an efficient, feasible 
(does not exceed the source capabilities) physical plan. We illustrate the process 
of translating a logical plan into a physical plan by an example. 

Example 1. Consider three sources that provide information about movies, and 
a mediator that provides an integrated view: 



Source 


Contents 


Must Bind 




R(studio, title) 


either studio or title 


52 


S (title, year) 


title 


5s 


T(title, stars) 


title 



Mediator View: 

Movie (studio .title , year , stars) : - 

R(studio, title) , S(title.year) , T(title , stars) 

The “Must Bind” column indicates what attributes must be specified at 
a source. For instance, queries sent to S\ must either provide the title or the 
studio. Suppose the user asks for the titles of all movies produced by Paramount 
in 1955 in which Gregory Peck starred. That is, 
ans (title) :- Movie ( ‘Parcunount ’ , title, '1955’, ‘Gregory Peck’) 

The mediator would translate this query to the logical plan: 
ans(title) :- R( ‘Paraunount ’ , title), S(title, ‘1955’), T(title, 

‘Gregory Peck’) 
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The logical plan states the information the mediator needs to obtain from the 
three sources and how it needs to postprocess this information. In this example, 
the mediator needs to join the results of the three source queries on the title 
attribute. There are many physical plans that correspond to this logical plan 
(based on various join orders and join methods). Some of these plans are feasible 
while others are not. Here are two physical plans for this logical plan: 

— Plan Pi'. Send query R( 'Paramount ’ , title) to S\, send query SCtitle, 
'1955’) to 5 ' 2 ; and send query T (title, 'Gregory Peck’) to S' 3 . Join the 
results of the three source queries on the title attribute and return the 
title values to the user. 

— Plan P 2 : Get the titles of movies produced by Paramount from source S\. 
For each returned title t, send a query to S 2 to get its year and check if it 
is ‘1955.’ If so, send a query to S 3 to get the stars of movie t. If the set of 
stars contains ‘Gregory Peck,’ return t to the user. 

In the above plans, the first one is not feasible because the queries to sources 
S 2 and S 3 do not provide a binding for title. The second one is feasible. There 
are actually other feasible plans (for instance, we can reverse the S 2 and S 3 
queries of ^ 2 )- If P 2 is the cheapest feasible plan, the mediator may execute it. 

As illustrated by the example, we need to solve the following problem: Given 
a logical plan and the description of the source capabilities, find feasible physical 
plans for the logical plan. The central problem is to determine the evaluation or- 
der for logical plan subgoals, so that attributes are appropriately bound. Among 
all the feasible physical plans, pick the most efficient one. 



1.3 Related Work 



The problem of ordering subgoals to find the best feasible sequence can be viewed 
as the well known join-order problem. More precisely, we can assign infinite cost 
to infeasible sequences and then find the best join order. 

The join-order problem has been extensively studied in the literature, and 
many solutions have been proposed. Some solutions perform a rather exhaustive 
enumeration of plans, and hence do not scale well !ll‘J4l5ISII,5H7ll9Ej()l‘iJI‘i9j . 
In particular, we are interested in Internet scenarios with many sources and 
subgoals, so these schemes are too expensive. Some other solutions reduce the 
search space through techniques like simulated annealing, random probes, or 
other heuristics pill 1 1 121 1 til I iSI23l'/!4f/!5) . While these approaches may generate 
efficient plans in some cases, they do not have any performance guarantees in 
terms of the quality of plans generated (i.e., the plans generated by them can be 
arbitrarily far from the optimal one) . Many of these techniques may even fail to 
generate a feasible plan, while the user query does have a feasible plan. 

The remaining solutions punEn use specific cost models and clever tech- 
niques that exploit them to produce optimal join orders efficiently. While these 
solutions are very good for the join-order problem where those cost models are 
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appropriate, they are hard to adopt in our context because of two difficulties. 
The first is that it is not clear how to model the feasibility of mediator query 
plans in their frameworks. A direct application of their algorithms to the prob- 
lem we are studying may end up generating infeasible plans, when a feasible plan 
exists. The second difficulty is that when we use cost models that emphasize the 
main costs in mediator systems, the optimality guarantees of their algorithms 
may not hold. 



1.4 Our Solution 

In this paper, we develop two algorithms that find good feasible plans. The first 
algorithm runs in O(n^) time, where n is the number of subgoals in the logical 
plan. We provide a linear bound on the margin by which this algorithm can miss 
the optimal plan. Our second algorithm can guarantee optimal plans in more 
scenarios than the first, although there is no bounded optimality for its plans. 
Both our algorithms are guaranteed to find a feasible plan, if the user query has 
a feasible plan. Furthermore, we show through experiments that our algorithms 
have excellent running time profiles in a variety of scenarios, and very often find 
optimal or close-to-optimal plans. This combination of efficient, scalable algo- 
rithms that generate provably good plans is not achieved by previously known 
approaches. 

2 Preliminaries 

In this section, we introduce the notation we use throughout the paper. We also 
discuss the cost model used in our optimization algorithms. 



2.1 Source Relations and Logical Plans 

Let Si, . . . , Sm be m sources in an integration system. To simplify the presen- 
tation, we assume that sources provide their data in the form of relations. If 
sources have other data models, one could use wrappers j0| to create the simple 
relational view of data. Each source is assumed to provide a single relation. If 
a source provides multiple relations, we can model it in our framework as a set 
of logical sources, all having the same physical source. Example E showed three 
sources Si, S 2 and S 3 providing three relations R, S and T respectively. 

A query to a source specifies atomic values to a subset of the source rela- 
tion attributes and obtains the corresponding set of tuples. A source supports a 
set of access templates on its relation that specify binding adornment require- 
ments for source queries Q In Example ^ source S 2 had one access template: 

^ We consider source-capabilities described as bf adornment patterns that distinguish 
bound (6) and free (f) argument positions 12 a The techniques developed in this 
paper can also be employed to solve the problem of mediator query planning when 
other source capability description languages are used. 
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{title, year), while source Si had two access templates: i?^-^(studio, title) 
and i?^*’(studio, title). 

User queries to the mediator are conjunctive queries on the integrated views 
provided by the mediator. Each integrated view is defined as a set of conjunctive 
queries over the source relations. The user query is translated into a logical plan, 
which is a set of conjunctive queries on the source relations. The answer to the 
user query is the union of the results of this set of conjunctive queries. Example ^ 
showed a user query that was a conjunctive query over the Movie view, and it 
was translated into a conjunctive query over the source relations. 

In order to find the best feasible plan for a user query, we assume that 
the mediator processes the logical plan one conjunctive query at a time (as in 
mm)- Thus, we reduce the problem of finding the best feasible plan for the 
user query to the problem of finding the best feasible plan for a conjunctive query 
in the logical plan. In a way, from now on, we assume without loss of generality 
that a logical plan has a single conjunctive query over the source relations. 

Let the logical plan be iL : — Ci, C 2 , . . . , We call each Ci a subgoal. Each 
subgoal specifies a query on one of the source relations by binding a subset of 
the attributes of the source relation. We refer to the attributes of subgoals in 
the logical plan as variables. In Example the logical plan had three subgoals 
with four variables, three of which were bound. 

2.2 Binding Relations and Source Queries 

Given a sequence of n subgoals Ci, C 2 , . . . , we define a corresponding se- 
quence of n -|- 1 binding relations /q, I\, . . . , In- lo has as its schema the set of 
variables bound in the logical plan, and it has a single tuple, denoting the bind- 
ings specified in the logical plan. The schema of Ii is the union of the schema of 
Iq and the schema of the source relation of Ci. Its instance is the join of Iq and 
the source relation of Ci. Similarly, we define I 2 in terms of /i and the source 
relation of C 2 , and so on. The answer to the conjunctive query is defined by a 
projection operation on 

In order to compute a binding relation Ij, we need to join Ij~i with Cj. 
There are two ways to perform this operation: 

1. Use /o to send a query to the source of Cj (by binding a subset of its 
attributes); perform the join of the result of this source query with Ij~i at 
the mediator to obtain Ij. 

2. For j >2, use Ij-i to send a set of queries to the source relation of Cj (by 
binding a subset of its attributes); union the results of these source queries; 
perform the join of this union relation with Ij~i to obtain Ij. 

We call the first kind of source query a block query and the second kind a 
parameterized quer^. Obviously, answering Cj through the first method takes 
a single source query, while answering it by the second method can take many 

^ A parameterized query is different from a semijoin where one can send multiple 
bindings for an attribute in a single query. 
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source queries. The main reason why we need to consider parameterized queries 
is that it may not be possible to answer some of the subgoals in the logical 
plan through block queries. This may be because the access templates for the 
corresponding source relations require bindings of variables that are not available 
in the logical plan. In order to answer these subgoals, we must use parameterized 
queries by executing other subgoals and collecting bindings for the required 
parameters of Cj. 

2.3 The Plan Space 

The space of all possible plans for a given user query is defined first by considering 
all sequences of subgoals in its logical plan. In a sequence, we must then decide 
on the choice of queries for each subgoal (among the set of block queries and 
parameterized queries available for the subgoal). We call a plan in this space 
feasible if all the queries in it are answerable by the sources. Note that the 
number of feasible physical plans, as well as the number of all plans, for a given 
logical plan can be exponential in the number of subgoals in the logical plan. 

Note that the space of plans we consider is similar to the space of left-deep- 
tree executions of a join query. As stated in the following theorem, we do not 
miss feasible plans by not considering bushy-tree executions. 

Theorem 1. We do not miss feasible plans because of considering only left- 
deep-tree executions of the subgoals. 

Proof. For any feasible execution of the logical plan based on a bushy tree of 
subgoals, we can construct another feasible execution based on a left-deep tree 
of subgoals (with the same leaf order). This is similar to the bound-is- easier 
assumption of m- See the full version of our paper m for a detailed proof. 

2.4 The Formal Cost Model 

Our cost model is defined as follows: 

1. The cost of a subgoal in the feasible plan is the number of source queries 
needed to answer this subgoal. 

2. The cost of a feasible plan is the sum of the costs of all the subgoals in the 
plan. 

We develop the main results of the paper in the simple cost model presented 
above. Later, in Section 0 we will show how to extend these results to more 
complex cost models. We also consider more practical cost models in Section 0 
where we analyze the performance of our algorithms. Here, we note that even 
in the simple cost model that counts only the number of source queries, the 
problem of finding the optimal feasible plan is quite hard. 

Theorem 2. The problem of finding the feasible plan with the minimum number 
of source queries is NP-hard. 
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Proof. We reduce the Vertex Cover problem (0) to our problem. Since the 
Vertex Cover problem is NP-complete, our problem is NP-hard. 

Given a graph G with n vertices Vi, . . . ,Vn, we construct a database and a 
logical plan as follows. Corresponding to each vertex Vi we define a relation Ri. 
For all 1 < i j < n, if Vi and Vj are connected by an edge in G, Ri and Rj 
include the attribute Aij. In addition, we define a special attribute X and two 
special relations Rq and Rn+i- In all, we have a total of to + 1 attributes, where 
TO is the number of edges in G. The special attribute X is in the schema of all the 
relations. The special relation Rn+i also has all the attributes Aij. That is, Rq 
has only one attribute and Rn+i has to + 1 attributes. Each relation has a tuple 
with a value of 1 for each of its attributes. In addition, all relations except Rn+i 
include a second tuple with a value of 2 for all their attributes. Each relation 
has a single access template: Rq has no binding requirements, i?i through i?„ 
require the attribute X to be bound, and Rn+i requires all of the attributes to 
be bound. Finally, the logical plan consists of all the n + 2 relations, with no 
variables bound. 

It is obvious that the above construction of the database and the logical plan 
takes time that is polynomial in the size of G. Now, we show that G has a vertex 
cover of size k if and only if the logical plan has a feasible physical plan that 
requires {n + k + 3) source queries. 

Suppose G has a vertex cover of size k. Without loss of generality, let it be 
Vi, . . . ,Vk- Gonsider the physical plan P that first answers the subgoal Rq with a 
block query, then answers R\, . . . ,Rk, Rn+i^Rk+i, ■ ■ ■ , Rn using parameterized 
queries. P is a feasible plan because Rq has no binding requirements, Ri, . . . , Rk 
need X to be bound and X is available from Rq, and Ri, , Rk will bind all the 
variables (since Vi , . . . , 14 is a vertex cover) . In P, Rq is answered by a single 
source query, R\,...,Rk and Rn+i are answered by two source queries each, 
and Rk+i, ■ ■ • , Rn are answered by one source query each. This gives a total of 
{n + k + 3) source queries for this plan. Thus, we see that if G has a vertex cover 
of size k, we have a feasible plan with (n + k + 3) source queries. 

Suppose, there is a feasible plan P' with / source queries. In P', the first 
relation must be Rq, and this subgoal must be answered by a block query (be- 
cause the logical plan does not bind any variables). All the other subgoals must 
be answered by parameterized queries. Gonsider the set of subgoals in P' that 
are answered before i?„+i is answered. Let j be the size of this set of subgoals 
(excluding Rq). Since Rn+i needs all attributes to be bound, the union of the 
schemas of these j subgoals must be the entire attribute set. That is, the vertices 
corresponding to these j subgoals form a vertex cover in G. In P', each of these j 
subgoals takes two source queries, along with Rn+i, while the rest of {n — j) sub- 
goals in i?i, . . . , i?„ take one source query each. That is, / = 1 + 2* j + 2+{n— j). 
From this, we see that we can find a vertex cover for G of size {f — n — 3). 

Hence, G has a vertex cover of size k if and only if there is a feasible plan 
with {n + k + 3) source queries. That is, we have reduced the problem of finding 
the minimum vertex cover in a graph to our problem of finding a feasible plan 
with minimum source queries. 



Optimizing Large Join Queries in Mediation Systems 355 



In our cost model, it turns out that it is safe to restrict the space of plans to 
those based on left-deep-tree executions of the set of subgoals. 

Theorem 3. We do not miss the optimal plan by not eonsidering the executions 
of the logical plan based on bushy trees of subgoals. 

Proof. See the full version of our paper IS2I for the proof. 

3 The CHAIN Algorithm 

In this section, we present the CHAIN algorithm for finding the best feasible 
query plan. This algorithm is based on a greedy strategy of building a single 
sequence of subgoals that is feasible and efficient. 



The CHAIN Algorithm 

Input: Logical plan - subgoals and bound variables. 

Output: Feasible physical plan. 

• Initialize: 

S <— {Cl, C 2 , . . . , C„} /*set of subgoals in the logical plan*/ 

B <— set of bound variables in the logical plan 
L ^ (j> /* start with an empty sequence */ 

• Construct the sequence of subgoals: 

while (S' 7 ^ (/> ) do 
M <— infinity, 

N <— null-, 

for each subgoal Ci in S do /* find the cheapest subgoal */ 
if (Ci is answerable with B) then 

c ^ CostL(Ci)-, /* get the cost of this subgoal in sequence L */ 
if ( c < M ) then 
M ^c-, 

N^Ci-, 

/* If no next answerable subgoal, declare no feasible plan */ 
if (N = null) 
return((/); 

/* Add next subgoal to plan */ 

L + N-, 

S^S-{N}- 
B <— B U {variables of A}; 

• Return the feasible plan: 

return(BZan(B)); /* construct plan from sequence L */ 



Fig. 1. Algorithm CHAIN 
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As shown in Figure 0 CHAIN starts by finding all subgoals that are answer- 
able with the initial bindings in the logical plan. It then picks the answerable 
subgoal with the least cost and computes the additional variables that are now 
bound due to the chosen subgoal. It repeats the process of finding answerable 
subgoals, picking the cheapest among them and updating the set of bound vari- 
ables, until no more subgoals are left or some subgoals are left but none of them 
is answerable. If there are subgoals left over, CHAIN declares that there is no 
feasible plan; otherwise it outputs the plan it has constructed. 

3.1 Complexity and Optimality of CHAIN 

Here we demonstrate: the CHAIN algorithm is very efficient; it is guaranteed to 
find feasible plans when they exist; and there is a linear bound on the optimality 
of the plans it generates. Due to space limitations, we have not provided proofs 
for all the lemmas and theorems. They are available in the full version of the 
paper P2[. 

Lemma 1. CHAIN runs in 0{n^) time, where n is the number of subgoals. 0 

Lemma 2. CHAIN will generate a feasible plan, if the logical plan has feasible 
physical plans. 

Lemma 3. If the result of the user query is nonempty, and the number of sub- 
goals in the logical plan is less than 3, CHAIN is guaranteed to find the optimal 
plan. 

Lemma 4. CHAIN can miss the optimal plan if the logical plan has more than 
2 subgoals. 

Proof. We construct a logical plan with 3 subgoals and a database instance that 
result in CHAIN generating a suboptimal plan. 



R^i\A,B,D) 




T”-' {D, F) 


(1, 1, 1) 


(b 1) 


(4, 1) 


(1, 2, 2) 


(2, 1) 


(5, 1) 


(1, 3, 3) 


(3, 1) 


(6, 1) 


(1. 1, 4) 


(4, 1) 


(7. 1) 



Table 1. Database Instance for Lemma E| 

Consider a logical plan H : — i?(l, B, D), S{B, E), T(D, F) and the database 
instance shown in Table d For this logical plan and database, CHAIN will gen- 
erate the plan: R ^ S ^ T, with a total cost of 1 -I- 3 -|- 4 = 8. We observe that 
a cheaper feasible plan is: R ^ T ^ S, with a total cost ofl-|-4-|-l = 6. Thus, 
CHAIN misses the optimal plan in this case. 

® We are assuming here that finding the cost of a subgoal following a partial sequence 
takes 0(1) time. 
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It is not difficult to find situations in which the CHAIN algorithm misses the 
optimal plan. However, surprisingly, there is a linear upper bound on how far its 
plan can be from the optimal. In fact, we prove a stronger result in Lemma |3 

Lemma 5. Suppose is the plan generated by CHAIN for a logical plan with 
n subgoals; P° is the optimal plan, and Emax is the cost of the most expensive 
subgoal in P° . Then, 

C0St{P‘^) <nX Emax 




Fig. 2. Proof for Lemma El 



Proof. Without loss of generality, suppose the sequence of subgoals in P'^ is 
Cl , C2, . . . , C„. As shown in FigureO, let the first subgoal in P° be Cmi ■ Let Gi 
be the prefix of P^^, such that Gi = Gi . . . Cmi- When CHAIN chooses Gi, the 
subgoal Cmi i® answerable. This implies that the cost of Gi in P'^ is less 
than or equal to the cost of Cmi ™ ■ After processing Gi in P'^, the subgoal 

Cmi remains answerable and its cost of processing cannot increase. So, if CHAIN 
has chosen another subgoal G2 instead of Cmi, once again we can conclude that 
the cost of G2 in is not greater than the cost of Cmi Finally, at the 

end of Gi, when Cmi i® processed in P'^, we note that the cost of Cmi i® 

no more than the cost of Cmi P° ■ Thus, the cost of each subgoal of Gi is less 

than or equal to the cost of Cmi ™ P°- 

We call Cmi the first pivot in P°. We define the next pivot Cm2 ™ P° ^'® 
follows. Cm2 i® the first subgoal after Cmi ™ P° such that Cm2 i® ^ot in Gi. 
Now, we can define the next subsequence G2 of P'^ such that the last subgoal of 
G2 is Cm 2 - The cost of each subgoal in G2 is less than or equal to the cost of 
Cm2 ■ 

We continue finding the rest of the pivots Cm^. , - ■ ■ , Cmu, in P° and the cor- 
responding subsequences G3, . . . , Gfc in P'^. Based on the above argument, we 
have 

VGi S Gj : (cost of Ci in P'^) < (cost of Cmj in P°) 

From this, it follows that 

k k 

Cost{P<^) = E E (cost of Gi in P“ ) < ^ I Gj \ x (cost of Cm^ in P° ) < n x Emax 

j=lCidGi 

Theorem 4. CHAIN is n- competitive. That is, the plan generated by CHAIN 
can be at most n times as expensive as the optimal plan, where n is the number 
of subgoals. 

Proof. Follows from Lemma El 
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The cost of the plan generated by CHAIN can be arbitrarily close to the 
cost of the optimal plan multiplied by the number of subgoals; i.e., Theorem 2] 
cannot be improved. However, in many situations CHAIN yields optimal plans 
or plans whose cost is very close to that of the optimal plan as demonstrated in 
Section El 

4 The PARTITION Algorithm 

In this section, we present another algorithm called PARTITION for finding 
efficient feasible plans. PARTITION takes a very different approach to solve the 
plan generation problem. It is guaranteed to generate optimal plans in more 
scenarios than CHAIN but has a worse running time. 

4.1 PARTITION 

The formal description of PARTITION is available in the full version of the 
paper m- Here, we present the essential aspects of the algorithm. 

The PARTITION algorithm has two phases. In the first phase, it organizes 
the subgoals into clusters based on the capabilities of the sources. The property 
satisfied by the clusters generated by the first phase of PARTITION is as follows. 
All the subgoals in the first cluster are answerable by block queries; all the 
subgoals in each subsequent cluster are answerable by parameterized queries 
that use attribute bindings from the subgoals of the earlier clusters. To obtain 
the clusters, PARTITION keeps track of the set of bound variables V. Initially, V 
is the set of variables bound in the logical plan. The first phase of PARTITION is 
divided into many rounds (one per cluster). In each round, the set of answerable 
subgoals based on the bound variable set V is collected into a new cluster. These 
subgoals are removed from the set of subgoals that are yet to be picked, and the 
variables bound by these subgoals are added to V. If in a round of the first phase 
there are subgoals yet to be picked and none of them is answerable, PARTITION 
declares that there is no feasible plan for the user query. 

In the second phase, PARTITION finds the best subplan for each cluster of 
subgoals and combines the subplans to arrive at the best overall plan for the user 
query. The subplan for each cluster is found by enumerating all the sequences of 
subgoals in the cluster and choosing the one with the least cost. 

4.2 Optimality and Complexity of PARTITION 

Like the CHAIN algorithm, the PARTITION algorithm always finds feasible 
plans when they exist. It is guaranteed to find optimal plans in more scenarios 
than CHAIN. However, when it misses the optimal plans, it can miss them 
by an unbounded margin. It is also much less efficient than CHAIN, and can 
take time that is exponential in the number of subgoals of the logical plan. 
These observations are formally stated by the following lemmas (proofs omitted 
occasionally due to space limitations). 



Optimizing Large Join Queries in Mediation Systems 359 



Lemma 6. If feasible physical plans exist for a given logical plan, PARTITION 
is guaranteed to find a feasible plan. 



Lemma 7. If there are fewer than 3 clusters generated, and the result of the 
query is nonempty, then PARTITION is guaranteed to find the optimal plan. 

Proof. We proceed by a simple case analysis. There are two cases to consider. 

The first case is when there is only one cluster Pi. PARTITION finds the best 
sequence among all the permutations of the subgoals in Ti. Since Ti contains all 
the subgoals of the logical plan, PARTITION will find the best possible sequence. 

The second case is when there are two clusters Pi and P 2 . Let P be the 
optimal feasible plan. We will show how we can transform P into a plan in the 
plan space of PARTITION that is at least as good as P. 

Let Ci be a subgoal in Ti. There are two possibilities: (a) Ci is answered 
in P by using a block query; (b) Ci is answered in P by using parameterized 
queries. If Ci is answered by a block query, we make no change to P. Otherwise, 
we modify P as follows. As the result of the query is not empty, the cost of 
subgoal Ci (using parameterized queries) in P must be at least 1. Since Ci is in 
the first cluster, it can be answered by using a block query. So we can modify P 
by replacing the parameterized queries for Ci with the block query for Ci . Since 
the cost of a block query can be at most 1, this modification cannot increase 
the cost of P. For all subgoals in Pi, we repeat the above transformation until 
we get a plan P' , in which all the subgoals in Ti are answered by using block 
queries. 

We apply a second transformation to P' with respect to the subgoals in /\. 
Since all these subgoals are answered by block queries in P' , we can move them 
to the beginning of P' to arrive at a new plan P” . Moving these subgoals ahead 
of the other subgoals will preserve the feasibility of the plan. It is also true that 
this transformation cannot increase the cost of the plan. This is because it does 
not change the cost of these subgoals, and it cannot increase the cost of the other 
subgoals in the sequence. Hence, P" cannot be more expensive than P' . 

After the two-step transformation, we get a plan P" that is as good as P. 
Finally, we note that P" is in the plan space of PARTITION, and so the plan 
generated by PARTITION cannot be worse than P" . Thus, the plan found by 
PARTITION must be as good as the optimal plan. 



Lemma 8. If the number of subgoals in the logical plan does not exceed 3, and 
the result of the query is not empty, then PARTITION will always find the 
optimal plan. 

Proof. Follows from Lemma 0 

PARTITION cannot generate the optimal plan in many cases. One can con- 
struct logical plans with as few as 4 subgoals that lead the algorithm to generate 
a sub-optimal plan. Also, PARTITION can miss the optimal plan by a margin 
that is unbounded by the query parameters. 
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Lemma 9. For any integer m > Q, there exists a logical plan and a database 
for which PARTITION generates a plan that is at least m times as expensive as 
the optimal plan. 

Proof. Refer to for the detailed proof. The essential idea is to construct a 
logical plan and a database for any given m that will make PARTITION miss 
the optimal plan by a factor greater than m. 

Lemma 10. The PARTITION algorithm runs in 0{nf + {ki\ + k 2 \ + . . . + kp\)), 
where n is the number of subgoals in the logical plan, p is the number of clusters 
found by PARTITION and ki is the number of subgoals in the cluster. 0 



4.3 Variations of PARTITION 

We have seen that the PARTITION algorithm can miss the optimal plan in 
many scenarios, and in the worst case it has a running time that is exponential 
in the number of subgoals in the logical plan. In a way, it attempts to strike 
a balance between running time and the ability to find optimal plans. A naive 
algorithm that enumerates all sequences of subgoals will always find the optimal 
plan, but it may take much longer than PARTITION. PARTITION tries to cut 
down on the running time, and gives up the ability to find optimal plans to a 
certain extent. Here, we consider two variations of PARTITION that highlight 
this trade-off. 

We call the first variation FILTER. This variation is based on the observation 
of Lemma 0 FILTER also has two phases like PARTITION. In its first phase, 
it mimics PARTITION to arrive at the clusters Pi, T 2 , . . . , Pp. At the end of the 
first phase, it keeps the first cluster as is, and collapses all the other clusters into 
a new second cluster P' . That is, it ends up with Ti and P'. The second phase of 
FILTER is identical to that of PARTITION. FILTER is guaranteed to find the 
optimal plan (as long as the query result is nonempty), but its running time is 
much worse than PARTITION. Yet, it is more efficient than the naive algorithm 
that enumerates all plans. 

Lemma 11. If the user query has nonempty result, FILTER will generate the 
optimal plan. 

Proof. We can prove this lemma in the same way we proved Lemma 0 

Lemma 12. The running time of FILTER is 0{n^ + (fci! + {n — /ci)!). 0 

The second variation of PARTITION is called SCAN. This variation focuses 
on efficient plan generation. The main idea here is to simplify the second phase 

If the query result in nonempty, PARTITION can consider just one sequence (instead 
of ki\) for the first clnster. 

® If the query result in nonempty, FILTER can consider just one sequence (instead of 
fci!) for the first cluster. 
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of PARTITION so that it can run efficiently. The penalty is that SCAN may not 
generate optimal plans in many cases where PARTITION does. 

SCAN also has two phases of processing. The first phase is identical to that 
of PARTITION. In the second phase, SCAN picks an arbitrary order for each 
cluster without searching over all the possible orders. This leads to a second 
phase that runs in 0{n) time. Note that since it does not search over the space 
of subsequences for each cluster, SCAN tends to generate plans that are inferior 
to those of PARTITION. 

Lemma 13. SCAN runs in 0{n^) time, where n is the number of subgoals in 
the logical plan. 

5 Other Cost Models 

So far, we discussed algorithms that minimize the number of source queries. 
Now, we consider more complex cost models where different source queries can 
have different costs. 

First, we consider a simple extension (say Mi) where the cost of a query 
to source Si is e^. That is, queries to different sources cost different amounts. 
Note that in Mi, we still do not charge for the amount of data transferred. 
Nevertheless, it is strictly more general than the model we discussed in Section 0 
All of our results presented so far hold in this new model. 

Theorem 5. In the cost model Mi, Theorem^ holds. That is, the CHAIN al- 
gorithm is n-competitive, where n is the number of subgoals. 

Theorem 6. In the cost model Mi, Lemma^holds. That is, the PARTITION 
algorithm will find the optimal plan, if there are at most two clusters and the 
user query has nonempty result. 

Next, we consider a more complex cost model (say M2) where the data trans- 
fer costs are factored in. That is, the cost of a query to source Si is Ci-I fiX (size 
of query result). Note that this cost model is strictly more general than Mi. 

Theorem 7. In the cost model M 2 , Theorem^ holds. That is, the CHAIN al- 
gorithm is n-competitive, where n is the number of subgoals. 

Theorem 8. In the cost model M 2 , Lemma^^does not hold. That is, the PAR- 
TITION algorithm cannot guarantee the optimal plan, even when there are at 
most two clusters. 

We observe that the n-competitiveness of CHAIN holds in any cost model 
with the following property: the cost of a subgoal in a plan does not increase by 
postponing its processing to a later time in the plan. We also note that the PAR- 
TITION algorithm with two clusters will always find the optimal plan (assuming 
the query has nonempty result) if block queries cannot cost more than the cor- 
responding parameterized queries. This property holds, for instance, in model 
Ml and not in model M2. When one considers cost models other than those 
discussed here, these properties may hold in them and consequently CHAIN and 
PARTITION may yield very good results. 
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6 Performance Analysis 

In this section, we address the questions: How often do PARTITION and CHAIN 
find the optimal plan? When they miss the optimal plan, what is the expected 
margin by which they miss? We answer these questions by experiments in a 
simulated environment. We used both the simple cost model of Section tZAl a,s 
well as the more complex cost model M 2 of Section 0in our performance analysis. 
The results did not deviate much from one cost model to the other. The details 
of the experiments are in m- Here, we briefly mention the important results 
based on the simpler cost model of Section 01 
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Figure El plots the fraction of the times the algorithms missed the optimal 
plans vs. number of query subgoals. Over a set of 1000 queries with number of 
subgoals ranging from 1 to 10, PARTITION generated the optimal plan in more 
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than 95% of the cases, and CHAIN generated the optimal plan more than 75% 
of the time. This result is surprising because we know that PARTITION can 
miss optimal plans for queries with as few as 4 subgoals and CHAIN can miss 
optimal plans for queries with as few as 3 subgoals. 

Figure 0 plots the average margin by which generated plans missed the op- 
timal plan vs. the number of query subgoals. Both CHAIN and PARTITION 
found near-optimal plans over the entire range of queries and, on the average, 
missed the optimal plan by less than 10%. 

In summary, the PARTITION algorithm can have excellent practical perfor- 
mance, even though it gives very few theoretical guarantees. CHAIN also has 
very good performance, well beyond the theoretical guarantees we proved in 
Section 0 Finally, comparing the two algorithms, we observe that PARTITION 
consistently outperforms CHAIN in finding near-optimal plans. 

7 Conclusion 

In this paper, we considered the problem of query planning in heterogeneous 
data integration systems based on the mediation approach. We employed a cost 
model that focuses on the main costs in mediation systems. In this cost model, 
we developed two algorithms that guarantee the generation of feasible plans 
(when they exist). We showed that the problem at hand is NP-hard. One of our 
algorithms runs in polynomial time. It generates optimal plans in many cases 
and in other cases it has a linear bound on the worst case margin by which 
it misses the optimal plans. The second algorithm finds optimal plans in more 
scenarios, but has no bound on the margin of missing the optimal plans in the 
bad scenarios. We analyzed the performance of our algorithms using simulation 
experiments and extended our results to more complex cost models. 
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Abstract. We give incremental algorithms, which support both edge in- 
sertions and deletions, for the all-pairs shortest-distance problem (APSD) 
in weighted undirected graphs. Our algorithms use hrst-order queries, -I- 
(addition) and < (less-than); they store O(n^) number of tuples, where 
n is the number of vertices, and have AC° data complexity for integer 
weights. Since FO{+, <) is supported by almost all current database 
systems, our maintenance algorithms are more appropriate for database 
applications than non-database query type of maintenance algorithms. 
Our algorithms can also be extended to duplicate semantics. 



1 Introduction 

Finding shortest-path information in a graph is one of the most commonly en- 
countered problems in the study of transportation and communication networks 
and has applications in communication systems, scheduling, computation of net- 
work flows, and in the context of document formatting etc. Since shortest-path 
information can be used for routing in a communication network, the possibility 
of changes to the network, say due to a link failure or a new link being added to 
the service, makes the incremental shortest-path problem relevant to routing. 
Problem statement: We consider the incremental maintenance of the all-pairs 
shortest distance (which is called the distance problem) in undirected graphs 
after an edge insertion and an edge deletion. We only consider single edge inser- 
tion and single edge deletion operations as the unit changes to the graph. Any 
distance-modification on an edge could be accomplished through first deleting 
the edge then followed by adding a new edge with the modified distance between 
the same vertices. 

Contributions: We start from the incremental maintenance algorithm for the 
all-pairs shortest-distance problem with positive distance on each edge {APSD > 
0), then extend the maintenance result to the all-pairs shortest-distance prob- 
lem with non-negative distance on edges {APSD > 0). A “restricted” reduction 
result between the APSD > 0 case and the APSD > 0 case is then given. Our 
maintenance algorithms for the distance problem can be applied to maintain the 
transitive closure problem in undirected graphs. 
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In undirected graphs, the restriction of “non-negative distance” makes each 
shortest walk between two vertices being a shortest path between the same two 
vertices. The presence of negative-distance edges implies that shortest walks can 
be unbounded. 

For the distance problem, our algorithms use first-order queries with addition 
“-I-” and less-than “<” operations {FO{+, <)) to maintain the shortest distance 
dynamically in undirected graphs after each edge insertion and edge deletion. 
The time complexity of our algorithms are dominated by “-I-” and “<” oper- 
ations (with an upper bound of NC^). When the distance of each edge in the 
graph is an integer, our algorithms have AC^ data complexity (the class 
of problems that can be solved using polynomially many processors in constant 
time). To the best of our knowledge, no AC'^ incremental algorithms (even for 
non-FO algorithms), supporting both edge deletions and edge insertions, for 
general undirected graphs are previously known. Our algorithms are the first 
of this kind. Since FO{+,<) is supported by almost all current database sys- 
tem, our incremental maintenance algorithms are more appropriate for database 
applications than non-FO incremental algorithms. For the maintenance of the 
transitive closure problem, we keep the “shortest length” (the number of edges 
on the shortest paths). Our algorithms use simple data structures. The absence 
of recursion in our algorithms makes additional optimizations possible, as a range 
of optimization techniques to evaluate relational database queries can be applied 
to our algorithms. The optimization of recursive queries is significantly harder 
than that of relational queries. 

All our algorithms employ one common technique as the basis for the main- 
tainability results for an edge deletion: They first delete a set of tuples whose 
existence in the relation depends on the deleted edge; this step may delete more 
than necessary. Then they correct the wrong deletions by doing the relational 
natural join of the result of the first step with the modified graph up to two times. 
This generalises the technique used in 0 for the maintenance of the transitive 
closure of acyclic digraphs. 

Although all the maintenance results in the paper are stated for the set 
semantics, they can be easily extended to the bag semantics (multiset). 

Related Works: |0| gives a full dynamic algorithm which requires O(n^) se- 
quential time for an edge insertion and/or an edge cost decrease, and 0{mn -\- 
n^logn) time for an edge deletion and/or an edge cost increase, where n is the 
number of vertices and m is the number of edges in the graph. The algorithm in 
P is for graphs with nice topologies such as trees and outerplanar graphs. ^ 
achieves logarithmic query and update times for planar digraphs through graph 
decomposition . ITTllIT^ also consider planar graphs. These previous dynamic al- 
gorithms and batch algorithms for the APSD problem use elaborate data struc- 
tures and need a recursive mechanism. Since most commercial database systems 
are relational database systems, which do not directly support recursion, a more 
powerful host language is needed for those algorithms to maintain the APSD 
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Given a recursive query, the non-recursive incremental evaluation approach 
uses non-recursive programs to compute the difference of the answers to the 
query against successive databases between updates. The mechanism used in 
this approach is called a “First-Order Incremental Evaluation System” in Q 
and “Dyn-FO” in For undirected graphs, reachability, connectivity, bipar- 
titeness and the minimum spanning forests problems have first-order incremental 
algorithms for edge insertion and edge deletion nm m considers maintaining 
views defined by constrained transitive closure queries in directed graph with 
weights but for insertion only. 

Comparing the algorithms of IE in with ours for the transitive closure 
problem in undirected graphs, their algorithms are based on the maintenance 
of spanning forests of the given undirected graph while ours are not. Our algo- 
rithms are structurally simple and do not need to maintain the order of edges, a 
successor relation on all vertices. The algorithms based on maintaining spanning 
forests are hard to convert to solve the APSD problem since the set of all edges 
on the shortest paths does not necessarily have a “tree-like” structure and, there- 
fore, a single modification to the undirected graph may cause the reconstruction 
of 0(|E|) number of spanning trees (where V is the set of vertices of the graph). 
Our algorithms also have a unifying scheme of computation, for transitive clo- 
sure in undirected graphs, in acyclic digraphs 0, and for the APSD > 0 (or 
APSD > 0) problem in undirected graphs. 

Our algorithms are in FO(-|-,<), using “-I-” and “<” to add and select the 
minimum length of paths. The storage demands of our algorithms stay within 
the same order of magnitude as the algorithm of jOj . 

As Spira and Pan HU show, for directed graphs, the batch all-pairs shortest- 
distance problem can, in some sense, be reduced to the problem of updating 
the solution to the all-pairs shortest-distance problem after an edge deletion. 
m says an incremental algorithm that saves only shortest-distance information 
cannot, in the worst case, do any better than a batch algorithm. Our results 
show the statement is not true for the APSD > 0 problem in undirected graphs. 
Organisation: We define some notations in Section 0 In Section 0 we give 
our incremental algorithm for the distance problem in undirected graphs with 
each edge having a positive distance. Section ^jextends the maintenance result of 
Section 0to the undirected graph with each edge having a non-negative distance. 
Section IE shows that the APSD > 0 problem can be maintained in the same way 
as that of APSD > 0 under certain restrictions and gives a reduction between 
APSD > 0 and APSD > 0. Section 0 presents some concluding remarks and 
additional applications of our results. 



2 Preliminaries 

We assume the reader is familiar with the first-order logic (or the relational 
calculus) . In this section we present some relevant standard terminology of graph 
theory. 
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An undirected graph G = (R, E) consists of a finite set of vertices V and 
a set of edges E such that E C {{u,v)\u,v G V}, where (u,v) is viewed as an 
unordered pair. Suppose each edge e of G has a distance, which is denoted as 
dist{e). For example, dist{e) may represent the “weight” of edge e or the time 
spent to travel between City a and City b where e = (a, b) . 

In this paper, all graphs mentioned are undirected graphs. We will denote 
“edge e = (a, 6) is in G” simply by G(a, 6) (or G(e)). 

An edge e can be inserted to (or deleted from) a graph G. We use G+e and 
G + e to denote G U {e} and use G_e and G — e to denote G — {e}. 

A sequence uqUi...u„ {n > 0) of vertices in G is a walk (between uq and u„) 
if (ui-i, Ui) is in G for each i G [l-.n]; the sequence is a path if it is a walk and 
Ui yf Uj for all 0 < i < j < n such that t yf 0 or j yf n. We say UiUi+i...Uj-iUj 
is a subpath of UQU\...Un for all 0 < t < j < n; when t yf 0 or j yf n, we call the 
former path a real subpath of the latter path. The distance of a path (or a walk) 
p (denoted as dist{p)) is defined as the sum of dist{e) of all edges e on the path 
(or walk). The length of a path (or a walk) p (denoted as length{p)) is defined as 
the number of edges on the path (or walk). Two vertices x and y of G is said to 
be reachable if there exists a path between x and y in G. 

The shortest- distance {shortest-length, respectively) path (or walk) between 
vertex x and vertex y is a path (or walk) between x and y with minimum distance 
(length, respectively). If there is no path between two vertices then the distance 
(length) is defined to be infinity. A shortest path is a path which has the shortest 
distance. 

We sometimes need to emphasise an ordering of the vertices on a path. We 
say a path (or walk) p (between vertex u and v) is from vertex u to vertex v to 
imply that p is in the order of u...u, and we say a path (or walk) p from m to u 
goes through edge e = (a, b) in the order of ab to imply that p has the form of 
p = u...ab...v. 

The all-pairs shortest-distance (APSD) problem (the distance problem for 
short) is to find the shortest distance between every pair of vertices in a graph. 
A restricted form of the APSD problem is the APSD > 0 {APSD > 0, respec- 
tively) problem which restricts each edge in the graph to a positive (non-negative, 
respectively) distance. We will discuss the APSD > 0 {APSD > 0, respectively) 
problem when the underlying graph is undirected and, as such, any edge inser- 
tion and distance-modification for the undirected graph preserve the property of 
each edge having a positive (non-negative, respectively) distance. 

The all-pairs shortest-path (APSP) problem (the path problem for short) is 
to find the shortest path between every pair of vertices in the graph. 

We will usually maintain relations SPq or SPDq for each undirected graph G. 
SPq is defined as {{x,y,d)\ a shortest path between x and y in G has distance 
d}. SPa{x,y,d) is true if and only if the shortest path between x and y in 
G has distance d. We assume SPa{u,u,0) holds for each vertex m of G and 
SPa{x,y,oo) (oo can expressed by a null value in the database) holds if there 
is no path between vertex x and vertex y. Similarly to SPq, we define SPDq 
to be {(x, y, I, d)| the shortest paths between x and y in G have distance d and 
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the least number of edges on all such shortest paths is 1}. SPDc{x,y,l,d) is 
true if and only if (i) the shortest paths between x and y in G have distance d 
and (ii) one shortest path between x and y in G passes I number of edges and 
all other shortest-paths between x and y in G pass no less than I number of 
edges. When there is no path between vertex x and vertex y in G, we assume 
SPDc{x,y, 00 , 00 ) holds. 

In an undirected graph G with non-negative distance edges, for each vertex u 
of G, we define SPDc{u, u, 0, 0). Clearly, SPDc{x, y, 0, d) implies (d = 0) A (a; = 
y). However, SPDc{x,y,l,0) does not necessarily mean I = 0 or x = y. 

For each binary relation R, we let i? = i? U {(u, u)\u is a vertex in R}. 

Clearly, in an undirected graph G, if SPc{u,v,d) {SPDG{u,v,l,d), respec- 
tively) holds then SPa{v,u,d) {SPDG{v,u,l,d), respectively) holds. The space 
upper bound needed for storing these relations is 0{\V^) where V is the set of 
vertices of G. 

Given a set of numbers {li, I 2 , ■■■, Ik}, we use min{^i, ^ 2 , ^fc} to denote the 
minimum element among the numbers Since the minimum operator (min) is 
applied to a bounded set at a time in our algorithms, it can be replaced by 
expressions using less-than (<). 



3 Incremental Algorithm for APSD > 0 

In this section, for an undirected graph G with each edge having a positive 
distance, we give our incremental maintenance algorithms to find the shortest 
distance between every pair of vertices after single edge insertion and single 
edge deletion. The technique for edge insertion is fairly simple and its similar 
forms have been used previously. The technique for edge deletion is much more 
involved than that for edge insertion. During the process of edge deletion, we 
use one property for undirected graphs (Lemma El): any shortest path of G_e 
can be regenerated through the join of two shortest paths, which do not pass 
through edge e, with one edge in G-e- It is because of this property that our 
algorithms belongs to FO{+,<). The algorithms in this section can be used to 
incrementally maintain the transitive closure of undirected graphs (by setting 
SPg to be {{x,y,l)\ there exists a path between x and y in G and I is the 
minimum number of edges on such paths }). 

The algorithms in this section use the following lemma. Intuitively, it implies 
that a shortest path can only be built from (other) shortest paths. 

Lemma 1. Given an undirected graph G, 

1. if path p = uw\...WkV (u = wq, v = Wk+i) is a shortest path between u and 
V then any real subpath q = Wi...Wj of p is the shortest path between Wi and 
Wj , where 0 < i < j < k + 1. 

2. if dist{g) > 0 for each edge g G G, a walk between two vertices can never be 
a shortest if it is not a path. 
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Proof. 1. Otherwise, suppose subpath qi = Wi...Wj is not a shortest path be- 
tween Wi and Wj. Let a shortest path between Wi and Wj be q 2 = WiVi...VhWj. 
Then dist{q 2 ) < dist{qi) and, for walk q = uwi..WiVi..Vh,Wj..WkV, dist{q) < 
dist{p). That is contradictory to p being a shortest path, and so the result 
is proved. 

2. Suppose walk q = UQU\...Uk is not a path. Then there exist 0 < i < j < k such 
that Ui = Uj and the distance of subwalk Ui...uj of q is positive. Therefore, 
q' = UQ..UiUj.\.i..Uk must have a shorter distance than q, and q is not a 
shortest path. <0 

The converse of Lemma m) does not hold. For example, let G = {(a, b, 1), 
(6, c, 1), (a, c, 1)}. Then SPcia, b, 1) A S'Pg(5, c, 1) A SPcia, c, 1) holds. The path 
a b CIS not a shortest path in G, but the subpaths a b and b c are shortest paths. 

3.1 Inserting Edge e to G 

The technique for edge insertion is simple. Similar techniques have been used in 
many recursive algorithms of |TT)l |21 0 0 and in the first-order algorithms 

of 0 El- 

Theorem 1. Suppose dist{g) > 0 for each edge g ofG. When adding a new edge 
e = (a, 6) to G where dist(e) > 0, we can construct a formula ip of FO(+, <) 
from SPc and e such that SPg+^ = ip- 

Proof. Let 

ip{x,y,d) = (3do,di,d2,d3,rf4) SPa{x,a,di) A SPG{b,y,d 2 ) A 
SPG(x,b, ds) A SPG{a,y,d4P) A SPG{x,y,do) A 
{d = min{dist{e) -I- di -I- ^ 2 , dist{e) -I- da -I- d^, do}). 

Observe that the use of min is applied to a set of three numbers. So it can be 
replaced by formulas with < operators. Therefore, ip G FO{+, <). 

We need to show S'Pc+e = V'- To show that 5'Tb+e '0> let {x, y, d) be any 
given tuple of SPg^^, Two cases arise: 

1. Edge e is not on the shortest paths between x and y. Then SPG{x,y,d) holds 
and so ip{x,y,d) is true. 

2. Edge e is on a shortest path between x and y. Let us assume, without loss of 
generality, the shortest path has the form of x...w...ab...v...y. By Lemma 0 
since path x...w...ab...v...y is the shortest path in G-|-e, paths qi = x...w...a 
and <72 = b...v...y are shortest paths and they clearly do not pass through 
edge e. Therefore, d = dist{qi) + dist{q 2 ) + dist{e) and ip{x,y,d) is true. 

To show that ip => SPg^,, let (x,y,d) be a tuple such that ip{x,y,d) is 
true. Observe that ip is selecting the smallest distance d of the shortest paths 
from X to y that passes through e and the shortest paths from x to y that do 
not pass through e. This d is clearly the shortest distance from x to y. Hence 
SPG^Sx,y,d) holds. 0 
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Example 1. Let 

G = {(a, j, 1), (a, e, 52), (5, c, 7), (6, fc, 7), (c, d, 2), (d, e, 2), (e, /, 4), 

(/, 5, 6), (3, h, 3), (3, /, 2), (/i, i, 8), {h, 1, 2), (ij, 19), {k, 1, 6)} 

as shown in Figure Tl. II IL The shortest paths between vertices are given in 
Fignre rT~rT 2L Consider inserting the edge £ = (a, b, 2) (dist{e) = 2). The shortest 
paths of G -h £ computed by the algorithm are shown in Figure 13. 1 1 31. For 
instance, SPG^^{a,c,9) holds since 9 = mm{2 -h 7, 45}, G(o, 6 , 2), S'Pg( 6 ,c, 7) 
and SPcia, c, 45). 




3.2 Deleting Edge e from G 

The main result in the section is for edge deletion. 

Theorem 2. For any undirected graph G satisfying dist{e) > 0 for each edge 
e of G, we can construct a formula of FO{+, <) from SPg and e such that 

SPg.. = V'. 

To prove Theorem 21 and to construct if, we need some lemmas, including 
the following key lemma. 

Lemma 2. Let dist{g) > 0 for each edge g of G. Suppose there exists a shortest 
path Pi = u...ab...v from vertex u to vertex v of G which goes through edge 
e = (a,b) in the order of ab (as shown in Figure\^^I)). Let p 2 = uw\...WkV 
(u = wq and v = Wk+i) be a shortest path between u and v in G — e. 
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a. If a shortest path of G from u to Wi (wi is on p 2 , 1 < i < k + 1) goes through 

edge e, then it goes through edge e in the order of ab. 

b. Similarly, if a shortest path of G from Wi (wi is on p 2 , Q < i < k) to v goes 
through edge e, then it goes through edge e in the order of ab. 

c. There exists a vertex Wi (1 < i < k + 1) such that for subpaths p 2 i = 
uw\...Wi-i and P 22 = Wi...v of p 2 , no shortest path of G among vertices 
u,wi, ...,Wi-i (and Wi,Wi+i, ...,v) goes through edge e. 




Proof. Since (a) and (b) are dual, we will prove (a) and (c) only. 

For (a), we show that the situation of Figure E3n) cannot happen. Otherwise, 
there exists a shortest path between u and Wi (for some 0 < i < fc + 1) which 
has the form of u...ba...Wi (expressed in dashed line). 

Let us denote the path, shown by the dashed line (the lower solid line, re- 
spectively), between u and b by pathdash(u,b) (pathsoUd{u,b), respectively) and 
denote the path, shown by the dashed line, between a and Wi by pathdash{ci, Wi). 

By the hypotheses and Lemma Q], the shortest paths between a and b have 
distance dist{e) and 

SPcia, b, dist{e)) A SPg{u, a, dist{pathdask{u, b)) -I- dist{e))/\ 

SPg{u, a, dist{pathsoiid{u, b)) — dist{e)) A 

SPg{u, b, dist{pathsoUd{u, b))) A SPg{u, b, dist{pathdash{u, b))). 

The formula means that the shortest paths between u and a have a dis- 
tance of dist{pathdash{u,b)) + dist{e) and of dist{pathsoHd{u,b)) — dist{e)] the 
shortest paths between u and b have a distance of dist{pathsoHd{u,b)) and of 
dist{pathdash(u,b)). Comparing the distance components (the third attribute) 
of above formula, we have 

dist(pathdash{u,b)) + dist{e) = dist{pathsoUd{u,b)) — dist{e)), and 
dist{pathsoUd{u,b)) = dist{pathdash{u,b)). 

Therefore dist{e) = 0, contradictory to the hypothesis that dist{f) > 0 for each 
edge / of G. 
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For (c), we illustrate the situation with Figure rr^T Tl. Let i be the smallest 
integer among -h 1] such that a shortest path between u and wi uses edge 
e and every shortest path between u and Wi-i does not use edge e. Since no 
shortest path from u to u uses e and edge e is on a shortest path between u and 
V, i exists. 

According to the way that vertex Wi is chosen, subpath p 2 i = uwi...Wi-i of 
P 2 is the shortest path between u and Wi-i both in G and G — e, and there is 
no shortest path of G among vertices u,w\, Wi_i using edge e. By Lemma ^ 
and Lemma 0a), 

SPg{u, Wi, d.3 + di + dist{e)) A SPg{u, a, ds) A SPcib, w^, d^) (1) 

holds. We now prove that e does not lie on any shortest paths among vertices 
Wi, ...,Wk,v in G. Otherwise, according to LemmaHand Lemma 0b), 

SPciwi, V, di -I- ^2 + dist{e)) A SPciwi, a, di) A SPcib, v, ^2) ( 2 ) 

holds. From m and (0, since a shortest path between two vertices is not longer 
than any other paths between the same vertices, we have 

c?3 -I- ^4 -I- dist{e) < ds + di, and di + d2 + dist{e) < d^ + d2- 

It follows that dist{e) < 0, which is contradictory to our positive-distance edge 
assumption. This implies that edge e does not lie on any shortest path among 
Wi, ...,Wk,v and therefore, no shortest path among Wi, ...,Wk,v goes through edge 
e. ^ 



Example 2. We now illustrate the above lemma by considering undirected graph 
G -|- £ in Example 0 The shortest path between vertex j (u = j in the Lemma) 
and vertex d {v = d in the Lemma) is pi = j a b c d. Path p 2 = jihgfed is 
the shortest path between j and d not going through edge e. In Lemma|2l(c), we 
can let the chosen vertex be h (wi = h in the Lemma). Therefore, pathij, i) = j i 
is the shortest path between j and i and path{h, d) = h g f e d is the shortest 
path between h and d (no shortest paths among i and j and among h, g, f, e, d 
go through edge (a, 5)) according to Lemma EJc). 

Lemma 0a) and Lemma 0b) imply an “ordering” on the vertices of G. 
They suggest that if one shortest path from an earlier vertex to a later vertex 
on path p 2 goes through edge e in one “direction” then every shortest path from 
any earlier vertex to any later vertex on p 2 goes through edge e in the same 
“direction” , so long as the shortest path passes through edge e. 

Given an undirected graph G satisfying dist{g) > 0 for each edge g of G 
and an edge e = (o, b) G G, the tuples of SPq can be classified into two kinds: 
Si, the set of tuples [x,y,d) such that no shortest path between x and y in G 
goes through edge e, and S 2 = SPq — Si. Lemma El implies that each shortest 
path of G — e which is not in Si can be regenerated, through at most two join 
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operations, by those shortest paths which do not go through edge e in G (i.e., 
in Si). In order to give a full description, we introduce some formulas. 

For edge e G G, let r^{x,y,d) be a formula stating the fact that there is a 
shortest path between x and y in G using edge e = (a, b), that is, 

r^{x,y,d) = (3do,di,d2,c?3,ci4) SPG{x,y,d) A [SPG{x,a,di) A 

SPG{b, y, ^ 2 ) A (d = di + c ?2 + dist{e)) V SPg{x, b, ds) A 
SPG{a, y, di) A {d = ds + di + dist{e))]. 

= SPG{x,y,d) A -^P^{x,y,d). 

Then Q^{x,y,d) states that the shortest paths between x and y in G have 
distance of d but no such a shortest path uses edge e. With the newly defined 
formulas. Lemma E[c) can be rewritten into the following form: 

Lemma EKd): Let dist{g) > 0 for each edge g of G. Suppose there exists a 
shortest path pi = u...ab...v from vertex u to vertex v of G which goes through 
edge e = (a, b) in the order of ab (as shown in Figure rTW ll I. Let p 2 = uwi...WkV 
(u = wq and v = Wk+i) be a shortest path in G — e. Then there exists a vertex 
Wi (1 < * < ^ + 1) such that for subpaths P 21 = uwi...Wi-i and P 22 = Wi...v of 
P 2 , (m, Wj_i, dist{p 2 i)) A G-e{wi-i,Wi) A dist{p 22 )) holds. 0 

Let 

<pf{x,y,d) = {3di,d2)(3wi,W2) f2^{x,wi,di) A G-e{wi,W 2 ) A 
n^{w2,y,d2) A {d = di + dist{wi,W2) + ^ 2 )- 

Lemma 3. Given an undirected graph G satisfying dist(e) > 0 for each edge 
e G G, SPg_^ is a subset of 'Pf . 

Proof. Suppose SPG_^{x,y,d) holds and we prove d>f{x,y,d) is true. 

Since SPG_^{x,y,d) holds, there exists a shortest path p = xwi...Wky (let 
X = Wo and y = Wk+i) in G — e such that SPG_^{x,y, dist(p)). Two cases 
arise: (a) In G, none of the shortest paths between x and y uses edge e. Then 
^e{x,y,d) is true. Since G-e{y,y) A G^{y,y,0) is true, G^{x,y,d), G-e{y,y) 
and G^{y,y,0) are all true. Therefore, (pf{x,y,d) holds, (b) Otherwise. There 
is a shortest path between x and y in G using edge e. From Lemma IJd), there 
exists Wi such that 

f2^{x,Wi-i,di) A G-e(wi-i,Wi) A (wi, y, ^ 2 ) A (d = di + dist{wi-i,Wi) + d2) 
is true. Therefore, <P'f{x,y,d) holds. <0 

Now, Theorem|2|can be easily proved from Lemma0 
Proof of Theorem UlLet 'ip{x,y,d) = ^°{x,y,d) A (Vd') [<P^{x,y,d') ^ (d < 
d')] Clearly, formula ip is in FO{+,<). From Lemma 0 we know SPg_^ C -ip. 
On the other hand, suppose ip{x,y,d) holds. By the definition of SPg_^, there 
exists d' such that SPG_^{x,y, d') holds. Since SPg_^ C ip, ip{x,y,d') holds. By 
the definition of ip, there is at most one d" for tuple (x.y) such that 'ipix.y.d") 
holds. So d = d'. Thus ^ C SPg_^. 0 
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Example 3. For a simple illustration of the Theorem, let us consider the undi- 
rected graph G+g of Example 0 Suppose we delete edge e = (a, h). Eg ^"{j, d, 12) 
is true since the shortest path j ab c d (S'Pc+e 0) d, 12) holds) between j and d 
uses edge (a,b). The shortest path between j and d in G+e — (a,b) (i.e., G) is 
path j i h g f e d (SPcij, d, 42) holds), (j, d, 42) is true since (j, i, 19), 
G(t, h) and d, 15) all hold and the distance of edge (i, h) is 8. 

Lemma El reveals why we have such an incremental maintenance result for 
undirected graphs. It says that the distance of a shortest path in G— e is bounded 
by the distances of (three) shortest paths in G. A similar result holds for acyclic 
digraphs, but not for general digraphs. 

Without the condition of dist{g) > 0 for each edge g of G, the “make-up” 
steps, in which we regenerate those new shortest paths, may not be completed 
through a bounded number of join operations. 

Example 4- Let G be a connected undirected graph with each edge’s distance 
being 0. Whenever an edge is deleted, each tuple of SPg will also be deleted by 
our algorithm, since the shortest path between every two vertices in the graph 
has the same distance as that of a walk passing the deleted edge. 

One interesting implication of above example is that we need to generalise 
the notion of “the distance of a path” to “the length of derivation”, for the 
incremental maintenance of undirected graphs with non-negative edge distances. 

4 Incremental Algorithm for APSD > 0 

At the end of SectionEl we have shown in ExampleElthat the general APSD > 0 
problem may not be maintained with only relation SPq for edge deletion. In this 
section, we extend the maintenance result of APSD > 0 to APSD > 0 and show 
such maintenance can be obtained through keeping SPDq, the extended relation 
of SPg by adding one more “attribute” to each tuple. We show that SPDg^^ 
and SPDg_^ can be maintained incrementally from G and SPDg in FO{+, <). 

We note that the incremental maintenance algorithm in Section Elcan also be 
used for the APSD > 0 problem after non-negative edge insertions and positive 
edge deletions. This can be proved similarly as those results in Section El Our 
algorithms in this section work for all single edge updates. 

The existence of 0-distance edges in an undirected graph G implies that some 
walks (which are not paths) have the same distance as some paths. When each 
edge of G has a positive distance, each tuple of SPg corresponds to paths of G. 
When G has 0-distance edges, a tuple of SPg may correspond to walks of G. 
Thus, after deleting a 0-distance edge, each 0-distance walk which goes through 
the deleted edge is deleted if we use the algorithm for edge deletion in SectionEl 
It is this fact that makes the algorithm for edge deletion of SectionEl incorrect for 
0-distance-edge deletions. To solve this problem, we will maintain SPDg instead 
of SPg- 

The following lemma means that each tuple (x,y,l,d) of SPDg corresponds 
to a path and each subpath of the path corresponds to a tuple of SPDg- 
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Lemma 4. Suppose dist{g) > 0 for each edge g in an undirected graph G and 
p = UQU\...Uk is a walk ofG. 

1. If SPDciuojUkjk, dist{p)) holds then p is a path. 

2. If SPDoiuo^UkT k, dist{p)) holds then for any subpath p' = UiUi+\...Uj (Q < 
i < j <k) of path p, SPDciui^UjG ~ f dist{p')) holds. 

Proof. 1. Suppose walk p = UQU\...Uk is not a path. Then there exist 0 < 
i < j < k such that Ui = Uj and the subwalk Ui...Uj of p passes at least 
one edge. Therefore, q = UQ..UiUj+\..Uk passes less edges than p does and 
dist{q) < dist{p). It implies that SPDG{uo,Uk,k, dist{p)) cannot be true. 

2. Otherwise, there exists a subpath qi = Ui...uj such that SPDoiui^UjG — 
i, dist{qi)) is not true. Let q 2 = UiV\...VhUj be a path between Ui and uj such 
that SPDciuijUjjh + dist{q 2 )) holds. Since -^SPDciuijUjG ~ f dist(qi)) 
holds, (dist{q 2 ) < dist{qi)) V {h + 1 < j — i) holds. Let us consider the walk 
q = UQUi..UiVi..VhUj..UkV. Clearly, {dist{q) < dist{p))\f {i + h + k — j + 1 < k) 
holds. This is contradictory to the fact that SPDG{uo,Uk, k, dist{p)) holds. 
Therefore, the result is proved. <0 

We note that the converse of Lemma HI) does not hold. For example, let 
G be the undirected graph of {(a, b, 1), {b, c, 0), (c, d, 0), (5, d, 0), {d, e, 1)}. Then 
q = abcde is a, path (dist{p) = 2). But ^SPDG{a,e,4:,2) A SPDcia, e,S,2) 
holds. 

4.1 Inserting Edge e to G 

Theorem 3. Suppose dist{g) > 0 for each edge g ofG. When adding a new edge 
e = (a,b) to G where dist{e) > 0, we can construct a formula ip of FO{+, <) 
from SPDq, G and e such that SPDq^^ = <p- 

Proof. Let ip be defined through the following formulas: 

SPc+A^^y^d) = (3do,di,(i2,rf3,c?4)(3Zo,^i,^2,^3,^4) SPDa{x,y,lo,do) A 

SPDG{x,a,li,di) A SPDG{b,y,l 2 ,d 2 ) A SPDo{x,b, h^dA) A 
SPDa{a,y,h,di) A 

{d = min{dist{e) + di + d 2 , dist{e) + ds + di, do}). 

A°{x,y,l,d) = ( 3 di,d 2 ,d 3 ,d 4 )( 3 ;i,Z 2 ,^ 3 , 4 ) SPc^A^iV^d) A 

\SPDG{x,y,l,d) V SPDG{x,aJi,di) A SPDG{b,y,l 2 ,d 2 ) A 
(d = dist{e) + di + d 2 ) A (^ = Zi + Z 2 + 1) V SPDg{x, 5, Z 3 , dz) A 
SPDg(o^ y, I4, dA A (d = dist{e) + dz + dA) A {I = Iz + U + 1)]. 
(f{x,y,l,d) = A^{x,y,l,d) A {yi')[A^{x,y,l' ,d) ^ {I < I')]. 

Clearly, ip G FO(+, <) since min can be replaced with less-than (<). 

We need to prove SPDg^^ = ip. In G + e, let g be a shortest path between 
X and y passing through Lnumber of edges such that SPDG^^{x,y,l,d) holds 
where d = dist{q). Two cases arise: 



Incremental _FO(+, <) Maintenance of All-Pairs Shortest Paths 377 



1. Edge e is not on path q. Then SPa{x,y,d) A SPDa{x,y,l,d) holds. From 
the way that ip is constructed, SPDc^^{x, y, Z, d) A (p{x, y, I, d) holds. 

2. Edge e is on path q. Let us assume, without loss of generality, that path q 
has the form of x...w...ab...v...y. Since SPDg^^{x, y, Z, d) holds for path q, by 
LemmaEl for subpaths qi = x...w...a and q2 = b...v...y, SPDa{x^a,li,di) A 
SPDG{b,y,l2,d2) holds where di = dist{qi) and k is the number of edges 
on path qi {i = 1,2). Therefore, d = di + d2 + dist{e), I = li + I2 + 1 and 
SPDg^^ C holds. Given a tuple (x,y,l,d) of SPDq^^, we claim that 
for each tuple (x,y,l',d) of Z < I' holds (Otherwise, there is a tuple 
{x, y, Iw, d) of Af such that lyj < 1 . Then there is a walk q^ in G+e such that 
dist(q^) = d and length{q^) = This is contradictory to the definition of 
SPDg^^)- Therefore, SPDg^^ C ip from the definition of p. 

On the other hand, suppose p{x,y,l,d) holds. By the definition of SPDq^^, 
there exist d' and V such that SPDG^^{x,y,l' ,d') holds. Since SPOq^^ C 
p{x,y,l' ,d') holds. By the definition of p, there is at most one d” and one 
Z" such that p{x,y,l",d") holds. So, d = d' and Z = V . Thus p C SPDg+ ■ 
0 



4.2 Deleting Edge e from G 

The maintenance result for edge deletion is 

Theorem 4. For any undirected graph G satisfying dist{e) > 0 for each edge e 
of G, we can construct a formula p of FO{+, <) from SPDg and e such that 
SPDg_, = if. 

Similarly to the APSD > 0 problem, we need the following lemmas to prove 
the maintenance result. 

Lemma 5. Suppose dist(g) > 0 for each edge g of G. In G, let pi be a shortest 
path between vertex u and vertex v which passes through the minimum number 
of edges of all such shortest paths (i.e., SPDg{u, v, Z, dist{pi)) holds for some I ). 
We assume, without loss of generality, pathpi goes through edge e = (a,b) in the 
order of u...ab...v as shown in Figure 1X^ /1. Let p2 = uw\...WkV be a shortest 
path between u and v (u = wq, v = Wk+i) in G — e and SPDa_,,{u,v,k + 
l,dist{p2)) holds. 

a. If a shortest-path qi from u to Wi (i = 1, ..., k + 1 ) goes through edge e and 
passes through li-number of edges such that SPDG{u,Wi,l\, dist{qi)) holds, 
then path qi from u to Wi goes through edge e in the order of ab. 

b. If a shortest-path 52 from Wi (i = 0 ,...,k) to v goes through edge e and 
passes through l2-number of edges such that SPDG{wi,v,l2, dist{q2)) holds, 
then path 52 from Wi to v goes through edge e in the order of ab. 

c. There exists a vertex Wi (1 < i < k + 1 ) such that for the subpaths p2i = 
uw\...Wi-i and P22 = Wi...v of p2, 

SPDg{u, Wi-i,i — 1, dist{p2i)) A SPDaiwi, v,k + 1 — i, dist{p22)) 
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holds and no shortest path among u,w\, (wi, ...,WkV, respectively) 

goes through edge e and passes h-number of edges in G satisfying h < i — 1 
(h < k + 1 — i, respectively) . 

Proof. Since (a) and (b) are dual, we will prove (a) and (c) only. 

For (a), we show that the situation of Figure 13. 21 TD cannot happen. Otherwise, 
there exists a shortest path q' (expressed in dashed line) between u and Wi for 
some 0 < i < k + 1 such that SPDc{u, Wi, fci + ^2 + 3, dist{q')). 

Let us denote the path, shown by the dashed line (solid line, respectively), 
between vertex u and vertex b by pathdash(u, b) {pathsoUd{u, b), respectively) and 
denote the path, shown by the dashed line, between a and Wi by pathdash{a, Wi). 
By the hypotheses and Lemma 01 the shortest paths from a to 6 in G have 
distance dist{e) and 

SPDc{a, b, 1, dist{e)) A SPDq{u, a, k\ + 2 , dist{pathdash{u, b)) + dist{e)) 
ASPDg{u, a, /i3, dist{pathsoiid{u, b)) — dist{e)) 

ASPDg(u, 6, /i3 + 1, dist{pathsoiid{u, b))) 

ASPDciu, b, ki + 1 , dist{pathdash{u, b))) 

holds. The formula means that the minimum number of edges on shortest paths 
between u and a is A:i + 2 and hs; the minimum number of edges on shortest 
paths between u and 6 is + 1 and /13 + 1. Comparing the third attribute of 
each component in the formula, we have an unsatisfiable condition (/13 + 1 = 
ki + 1 ) A {ki + 2 = /13) being true. Therefore, we proved (a). 

For (c), we illustrate the situation with Figure E2i;i). Let i be the smallest 
integer from [l...fc+ 1] (i) there is a shortest path p' from u to Wi going through 
edge e such that SPDa{u,Wi, length{p'), dist{p')) holds and (ii) any path Puwi-i 
from u to Wi-i satisfying SPDQ{u,Wi-i,length{puwi-f), dist{puwi-i)) does not 
go through edge e. Since SPDg{u, u, 0, 0) A SPDg{u, v, I, dist{pi)) holds, i exists. 

Let us consider subpath p2i = uwi...Wi-i of p2- Since SPDg_^{u,v, k + 
1, dist{p2)) holds, from Lemma0 SPDG_^{u,Wi-i,i — 1, dist{p2i)) holds. From 
the way that vertex Wi is chosen and the definition of SPDg, SPDg{u, Wi-i, i — 
1 , dist{p2i)) holds. Therefore, subpath p2i satisfies (c). 

Let q3 (g4, respectively) be the subpath from u to a (from b to Wi, respec- 
tively) of p' . Denote dist(qh) (length(qh), respectively) by dh {Ih, respectively) 
for ft, = 3,4. According to Lemma EJ a), 

SPDg{u, a, I3, ds) A SPDa{h, Wi, U, di)A (3) 

SPDg{u, Wi, I3 + I4 + 1, ds + d4 + dist{e)) 

holds. We now prove that there is no path p” of G using edge e among vertices 
Wi, ...,Wk,v in G such that SPDG{wi,v,length{p"), dist{p")) holds. Otherwise, 
Let qi (52, respectively) be the subpath from Wi to a (from ft to v, respectively) of 
p" . Denote dist{qh) {length{qh), respectively) by du {Ih, respectively) for ft = 1, 2 . 
By Lemma 0 and Lemma |^b), 

SPDg{wi, V, I1+I2 + 1, c?i -I- ^2 + dist{e))A 
SPDG{wi,a,h,di) A SPDG{b,v,l2,d2) 



( 4 ) 
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holds. From m and 0 and since a shortest path between two vertices is not 
longer than any path between the same vertices, we have 

c?3 + ^4 + dist{e) < ds + di, and d\ + d2 + dist{e) < d^ + d2- 

The formulas imply dist{e) = 0 and d\ = d4. Consequently, from the hypotheses 
and Lemma 0 



SPDc{a, Wi, I4 + 1, ^4) A SPDa{wi, a, h, d^)A 
SPDo{b, Wi, I4, ^4) A SPDciwi, b, + 1, ^4) 

holds. Form the above formula and the definition of SPDg, we have 

^4 “h 1 = l\, and I4. — I4 1. 

The above formulas imply 1 = 0. This is contradiction. Thus we proved the 
result. 

For edge e = (o, b) of G, let P^{x, y, I, d) be a formula expressing those tuples 
of SPDg using edge e, that is, 

r^{x,y,l,d) = {3di,d2,d3,d4){3li,l2,k,h) SPDG(x,y,l,d) A 

[SPDg{x, a,h,di) A SPDG{b,y,l2,d2) A {I = h + I2 + 1 ) A 
{d = di d2 dist{e)) V SPDg{x^ 6, ^3, ^3) A SPDG{a^ y, U, c/4) A 
(/ = ^3 -|- ^4 -|- 1) A (c? = d^ C?4 -|- dz5i(6))]. 
n^{x,y,l,d) = SPDG{x,y,l,d) A ^P^ {x,y,l,d). 

Thus f 2 ^{x,y,l,d) states that the shortest paths between x and y which have 
distance d and length I do not use edge e. With the newly defined formulas. 
Lemma |3c) can be rewritten into the following form: 

Lemmal^d): Suppose dist{g) > 0 for each edge g of G. In G, let p\ be a shortest 
path between vertex u and vertex v which passes through the minimum number 
of edges of all such shortest paths (i.e., SPDg{u,v, 1 , dist{pi)) holds for some 1 ). 
We assume, without loss of generality, path pi goes through edge e = (a, b) in the 
order of u...ab...v as shown in Figure fOl fiL Let p2 = uw\...WkV be a shortest path 
between u and v {u = wq, v = Wk+i) in G — e and SPDg_,, (u, v,k + 1, dist(j>2)) 
holds. Then there exists a vertex Wi (1 < i < such that for subpaths 

P2i = uwi...Wi-i and P22 = w^...v of p2, Gf {u,w^-i,length{p2\), dist{p2i)) A 
G-e{wi-i,Wi) A nf{wi,v, length{p22), dist{p22)) holds. <> 

Let <P^{x,y,l,d) be 

{3di,d2){3li,l2){3wi,W2) n^{x,wi,li,di) A G-e{wi,W2)A 
Cf {w2,y, h, d2) A [{I = I 1 +I 2 + I) A {wi yf W 2 ) V {I = h + h) A 
{wi = W 2 )] A (d = di -I- dist{w\,W2) + ^ 2 )- 



The following is the main lemma of the section. 
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Lemma 6. Given an undirected graph G satisfying dist(e) > 0 for each edge 
eeG, SPDg_, C 

Proof. Suppose SPDG_,{x,y,l,d) holds. We prove d>f{x,y,l,d) is true. 

Since SPDG_^{x,y,l,d) holds, there exists a shortest path p = xwi...wi-iy 
(let X = Wo and y = wi) in G — e such that SPDG_^{x,y,l, dist(p)). Two 
cases arise: (a) None of the shortest paths between x and y use edge e in G. 
Then G^{x, y, I, d) is true. Since G-e{y, y) A y, 0, 0) is true, G^{x, y, I, d), 

G-e{y,y) and G^{y,y,0) are all true. Therefore, <l>^{x,y,l,d) holds, (b) Other- 
wise. Then there is a shortest path between x and y in G using edge e. From 
Lemma |2Kd), there exists Wi such that 

(3di,d2)(3^i,^2) fi^{x,w^-i,h,di) A G-e{w^-i,wf) A Gf {wi,y,h,d2) 

/\(l — G 12 + 1) A {d = di + dist{wi-i,Wi) + c?2)- 

is true. Therefore, <Pf{x,y,l,d) holds. <0> 

Proof of Theorem □ Let 

MP^{x,y,l,d)=<pf{x,y,l,d) A{yd'){\/1') [<pf{x,y,l',d') {{d < d')], and 

g,{x, y, I, d) = MPf (x, y, I, d) A (W') [MPf (x, y, I', d) ^ {I < 1% 

Clearly, ip is in FO{+, <). We need to prove SPDg_, = p- 

1. Suppose SPDa_^{x,y,l,d) holds. From Lemma |H1 we know SPDa_^ C 

By the definition of SPDg_^, SPDg_^ Q MP^ holds and SPDg_^ Q p 
holds. 

2. Suppose <p{x, y, I, d) holds. By the definition of SPDo_^, there exist d' and V 
such that SPDG^,{x,y,l' ,d') holds. Since SPDg_, C ip{x,y,l',d') holds. 
Then there is at most one d” and one I" such that <p{x^ y, I" ^ d") holds from 
the definition of p. So, d = d' and I = V . Thus p C SPDg_^- 

<> 



5 A Limited Reduction between SPq and SPDq 

The algorithm for APSD > 0 in Section 0 does not need to maintain any 
auxiliary information during the evaluation while the algorithm for APSD > 0 
in Section 0 does need auxiliary relation. The algorithm for the APSD > 0 
problem can be considered as using the algorithm for APSD > 0 twice, once for 
distance and once for length, in a nested way. 

In many practical applications, any modification on the edges of undirected 
graph G must have some physical limitations, e.g., the amount of information 
passed through a cable is bounded by the physical property of the cable and the 
length of cables used is finitely bounded. To model such situations, we assume. 
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for each edge g of G = {V,E), dist{g) is a non-negative integer and n is the 
upper bound of \V\. Under such assumption, the APSD > 0 problem can be 
solved by using the method for APSD > 0 through redefining the dist{g) with 
DIST{g) for each edge g of G, where DIST{g) = dist{g) x n -|- 1 . In this way, 
for a given undirected graph G, we actually construct a new graph G^ , which is 
the same as G except that distance is defined by ‘‘‘‘DIST’ instead of “disf . The 
incremental maintenance of the APSD > 0 problem for graph G with relation 
SPDq is transformed to the maintenance of the APSD > 0 problem for graph 
G^ with relation SPql . 

The correctness of this transformation is based on the following two facts: 
(i) For each tuple (x,y,l,d) of SPDq there is only one tuple (x,y,dL) of SPql 
where = d x n + I and, for each tuple {x,y,dL) of SPql there is only one 
tuple (x,y,d,l) of SPDq where d = {dh div n) and I = (c?l mod n) ; (ii) For 
any two paths pi and p2 between vertex x and vertex y, if dist(pi) < dist(j>2) 
then DIST{pi) < DIST{p2) holds and if DIST{pi) < DIST{p2) then dist{pi) < 
dist(p2) holds. Since DIST{pi) = dist{pi)xn+li and DIST{p>2) = dist(p2)xn+l2, 
DIST{p2) — DIST{pi) = {dist{p2) — dist{pi))xn+{l2 — h)-Therefore, if dist{p2) > 
dist{pi) holds, then DIST{p2) — DIST{pi) >n — (n — 1)>0 holds. Similarly, we 
could prove that dist{pi) < dist(j>2) holds if DIST{pi) < DIST{p2). 

Let G = (U, E) be an undirected graph with restrictions such that dist{g) is 
an integer for each edge g of G and n is the upper bound of \V\. The key point 
of the algorithm is that I < n holds for each tuple (x,y,l,d) of SPDq. 



6 Concluding Remarks 

We have given UO(-|-,<) algorithms for the incremental maintenance of the 
APSD > 0 problem and the APSD > 0 problem. Our algorithms have a low 
parallel complexity and make additional optimizations possible (e.g., the opti- 
mization techniques in relational database) . These extend earlier results on the 
maintenance of transitive closure of undirected graphs and the results in jSj. 
For example, our algorithms for the APSD > 0 problem can be used to answer 
queries such as “list the minimum number of cities on a shortest path” . 

Our results can be applied to the following problems: (i) Although all the 
maintenance results in the paper are written for set semantics, they can be 
easily extended to bag semantics (multiset) by storing the number of alternative 
shortest paths of a tuple along with the tuple itself. For example, considering 
the APSD > 0 problem, we define SPq to be {{x,y,d,c)\ d is the distance 
of the shortest path between x and y, c is the number of alternative shortest 
paths between x and y }. SPQ{x,y,d,c) is true if and only if there exists c 
alternative shortest paths between x and y in G which have distance of d. The 
maintenance algorithms are similar to that of SPq. (ii) Our algorithms can be 
used to maintain the shortest paths themselves. Refer to unj for more details. 

Acknowledgements: We would like to thank Kevin Glynn for his helpful com- 
ments. We are also grateful to the anonymous referees for their very helpful 
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Abstract. In this paper we present a new approach for studying aggre- 
gations in the context of database query languages. Starting from a broad 
definition of aggregate function, we address our investigation from two 
different perspectives. We first propose a declarative notion of uniform 
aggregate function that refers to a family of scalar functions uniformly 
constructed over a vocabulary of basic operators by a bounded Turing 
Machine. This notion yields an effective tool to study the effect of the 
embedding of a class of built-in aggregate functions in a query language. 

All the aggregate functions most used in practice are included in this 
classification. We then present an operational notion of aggregate func- 
tion, by considering a high-order folding constructor, based on structural 
recursion, devoted to compute numeric aggregations over complex val- 
ues. We show that numeric folding over a given vocabulary is sometimes 
not able to compute, by itself, the whole class of uniform aggregate func- 
tion over the same vocabulary. It turns out however that this limitation 
can be partially remedied by the restructuring capabilities of a query 
language. 

1 Introduction 

Computing aggregations has been always considered an important feature of 
practical database query languages. This ability is indeed fundamental in specific 
application domains whose relevance has recently increased; among them, on- 
line analytical processing (OLAP), decision support, statistical evaluation, and 
management of geographical data. In spite of this fact, a systematic study of 
aggregations in the context of query languages has evolved quite slowly. Apart 
from the work by Klug H3I, who formalized extensions of algebra and calculus 
with aggregates, in the last decade there have been few papers dealing with this 
subject even if specific aggregate functions of theoretical relevance, 

like counters, have received special attention mn\- 

The general approach to the problem is to study basic properties of formal 
languages equipped with a specific class of built-in aggregate functions (typi- 
cally the ones provided by SQL, that is, min, max, sum, COUNT and AVG, plus 

* This work was partially supported by CNR and by MURST. 
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further functions of theoretical interest, like even). Conversely, we would like 
to attack the problem from a more general perspective, possibly independent of 
the specific aggregate functions chosen. To do that, we first need to answer a 
fundamental question: what is precisely an aggregate function? It is folk knowl- 
edge that a database aggregate function takes a collection of objects (with or 
without duplicates) as argument, and returns a new value that “summarizes” 
a numeric property of the collection. This broad definition is clearly too vague 
to provide clues for answering general questions about query languages with 
aggregate capabilities. 

Our first goal is then trying to refine this definition, in order to provide a solid 
basis to the whole picture. Borrowing some ideas from the circuit model P2CEI, 
we start by noting that an aggregate function g over a collection s can be ef- 
fectively described by a family of scalar functions (that is, traditional fc-ary 
functions) G = {goj 5ij < 72 , • ■ ■} such that, for each k > 0: (i) gk computes g when 
s contains exactly k elements; (ii) the result of gk is invariant under any per- 
mutation of its arguments (as g operates on collections rather than sequences); 
and (iii) gk is constructed by using a fixed vocabulary of basic operations and 
constants. To guarantee tractability of this representation in terms of families 
of scalar functions, we need to set some constraint on the way in which the 
various gk are built. We make this constraint precise by introducing a notion of 
uniform construction, stating that a circuit description of each function gk in 
G needs to be “easily” generated from k (specifically, from a Turing machine 
having bounded complexity). In this way, the family G forms a good represen- 
tative of the aggregate function g, since the computational power is captured by 
the gk’s themselves rather than by their construction. This approach leads to 
the definition, in a very natural way, of different and appealing abstract classes 
of aggregate functions: a class is composed of all the aggregate functions that 
can be represented by a uniform family of functions over a given vocabulary. 
With the assumption that each scalar function in the vocabulary has constant 
computational cost, uniform construction indeed guarantees tractability of the 
aggregate functions so defined. 

We are now ready to address the impact of incorporating aggregate func- 
tions in a database query language. To this end, we consider a two-sorted al- 
gebra for complex values P^, called CVA, over an uninterpreted domain and an 
interpreted one. In this context, we first study extensions of CVA with built-in 
aggregate functions, that is, with operators whose semantics is defined outside 
the database. We then present a simple constructor, called folding, that allows 
the user to define and apply an aggregation within CVA. This operator is based 
on structural recursion jS] and is defined in terms of a pre function p, an in- 
crement function i over the variables acc and curr, and a post function q. The 
application of a folding expression over a collection of complex values s returns 
a new value by iterating over the elements of s in a natural way: starting from 
acc = p{s), for each element curr of s (chosen in some arbitrary order) the func- 
tion i is applied to acc and curr; the result is obtained by applying q to the value 
of acc at the end of the iteration. A constraint on i guarantees that the result 
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of folding over a collection s is indeed independent of the order in which the 
elements of s are selected during the evaluation. We then restrict our attention 
to a simplified version of folding devoted to compute numeric aggregations: it 
operates only on multisets of interpreted atomic values and returns a numeric 
value, without involving complex data types in its evaluation. 

We finally relate the abstract notion of uniform aggregate function with this 
procedural way of computing aggregations. We first show that, even if numeric 
folding is less expressive than general folding, they have the same expressive 
power when embedded in CVA] that is, the restructuring capabilities of the 
algebra overcome the gap in expressiveness. We then show that numeric folding 
over a given set of scalar functions computes only uniform aggregate functions 
over the same set of scalar functions, but not all of them. We demonstrate that 
this limitation can be partially overcome by the restructuring capabilities of 
a query language. It turns out however that CVA, extended with the folding 
operator and a vocabulary of scalar functions, is still not able to capture the 
whole class of uniform aggregate functions over the given vocabulary. 

The rest of the paper is devoted to a formalization of the issues discussed 
in this section and is organized as follows. In Section ^ we introduce the basic 
notions and present a number of examples of uniform aggregate functions. In 
Section 0 we introduce the CVA query language and investigate extensions of 
CVA with built-in aggregate functions. The folding operator is introduced in Sec- 
tion 0 where we also characterize an important restriction of it. In Section El we 
relate the expressive power of the various extensions of CVA with aggregations. 
Finally, in Section El we state a number of interesting open problems. 



2 Aggregate Functions 

2.1 Basic Definitions 

Let us start with a very general definition of aggregate function. Let de- 

note the class of finite multisets of values from a countably infinite domain Af. 
(Multisets generalize sets, in that they allow an element to occur multiple times.) 



Definition 1. An aggregate function over Af is a total function from to 

Af, mapping each multiset of values to a value. 

Examples of aggregate functions over numeric domains are sum (^), product 
(H), counting, average, maximum, and minimum. Maximum and minimum are 
also examples of aggregate functions over collections of non-numeric domains 
like strings. Note that we require aggregate functions to be total; in some cases, 
this requires some attention. For instance, the sum of a multiset of numbers 
is always well-defined; conversely, in defining the maximum function MAX, an 
arbitrary choice for its result over the empty multiset is required. 

In the rest of the paper, we will only consider aggregate functions over the 
domain (Q of the rational numbers. 
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An important aspect of our approach is that we clearly separate the restruc- 
turing capabilities of a query language from its ability to compute aggregations. 
Specifically, we make a distinction between aggregate functions (that is, aggrega- 
tions over numbers) and aggregate queries (that is, queries involving aggregate 
functions). Under this interpretation, the counting of an arbitrary set (say, of 
a set of strings) is an aggregate query, which can be computed by first map- 
ping each element to some numeric constant (say, 1), and then applying an 
aggregate function that counts the elements of the resulting numeric multiset. 
Similarly, testing whether two sets are equinumerous can be viewed as an ag- 
gregate query accomplished by computing the cardinalities of the sets with an 
aggregate function, and then checking for their equality. Finally, we note that 
the maximum function with SQL semantics is an aggregate query that returns 
a singleton set over non-empty sets and an empty set over empty sets; again, 
this can be implemented by using a maximum aggregate function together with 
some restructuring operations. 



2.2 Uniform Aggregate Ftinctions 

In order to provide a concrete basis for the investigation of aggregations in the 
context of database queries, we now refine the definition of aggregate function. 
We first introduce a number of preliminary notions, to develop the following 
ideas, inspired from the circuit model nasi: (i) an aggregate function can 
be represented by a family of scalar functions; (ii) each scalar function can be 
described by an arithmetic circuit over a collection of base functions^ (iii) uni- 
formity in the construction of a family of circuits guarantees tractability of the 
represented aggregate function. 

A scalar function (over Q) is a total function from to Q, with k > 0. 
(Examples of scalar functions are the nullary functions 0 and 1 and the binary 
functions -I-, *, — , and /.) An enumeration of a multiset s = -ffi, . . . ,Vn^ is a 
tuple s = [ui, . . . , Vn] containing the same elements as s with the same multiplic- 
ities, in any of the possible orderings. A family of functions is a set G of scalar 
functions such that, for each fc > 0, there is one function gfc : ^ Q in G, 

called the k-th component of G. 

We say that a family G of functions represents an aggregate function 
^ • IQS’ ^ Q ifj foi' each k > 0, each multiset s of cardinality k, and each 
enumeration s of s, it is the case that 5 fc(s) = h(s), where gk is the fc-th compo- 
nent of G. Note that the above definition implies that only symmetric functions 
can be used in the representation of aggregate functions. We recall that a fc-ary 
function / is symmetric if it remains invariant under any permutation of its ar- 
guments, that is, f{xi , . . . , Xk) = f{x^(^i ), . . . , Xa(k)) for every permutation <j on 

Let a vocabulary be a set of scalar functions over Q. A circuit over a vocabu- 
lary is a labeled directed acyclic graph whose nodes are either function nodes or 

^ Actually, there are other ways to describe scalar functions, such as straight line 
programs and formulae, which are essentially equivalent to arithmetic circuits. 
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are input nodes. Each function node is labeled by a function in the vocabulary; 
a node labeled by a m-ary function has precisely m ingoing arcs (the arcs are 
ordered). An input node is a node with no ingoing arcs labeled by Xi, with f > 0. 
There is one distinguished node with no outgoing arcs, called the output of the 
circuit. A k-ary circuit is a circuit with at most k input nodes having differ- 
ent labels from . . . ,Xk- The semantics of a k-axy circuit is the fc-ary scalar 
function computed by the circuit in the natural way. A circuit describes a scalar 
function / if its semantics coincides with /. An encoding of a circuit is a listing 
of its nodes, with the respective labels. The size of the circuit is the number of 
its nodes. A description of a function /, together with its size, is a circuit that 
describes /. Examples of circuits are reported in Figure ^ 

To guarantee the tractability of a family of functions, we introduce a notion 
of uniformity. A family G of functions is uniform if a description of the n- 
th component of G can be generated by a deterministic Turing machine using 
O(logn) workspace on input 1”. 

Definition 2. A uniform aggregate function over a vocabulary f2 is an aggregate 
function that can be represented by a uniform family of functions over f2. 

As a particular but important case, we call uniform counting functions the uni- 
form aggregate functions that can be represented by families of constant func- 
tions. We note that, since a constant function does not depend on its arguments, 
a uniform counting function can be described by a family of circuits without in- 
put nodes. 

It is worth noting that we use the term “uniform” to mean logspace uniform, 
as it is customary in the context of the circuit model, where several different 
notions of uniformity are essentially equivalent to logspace uniformity. Note also 
that, as a consequence of the requirement of logspace computability in the defi- 
nition, the size of the description of the n-th component of any uniform family 
is polynomial in n. 

As a first example. Figure d shows the first three components of a family 
describing the average aggregate function AVG over the vocabulary where 
= {0,1, -I-,—,*,/}. 

We note that, for each fc > 0, the left and right subtrees of the output of AVGfc 
correspond to the k-th component of the aggregate functions SUM and COUNT, 
respectively. Actually, these three aggregate functions can be defined inductively 
as follows: 



SUMo = 0 SUMfc = SUMfe_i +Xk for fc > 0 

COUNTo = 0 COUNTfc = COUNTfe_i -|- 1 for fc > 0 

AVGo = 0 AVGfc = SUMfc/COUNTfc for fc > 0 

Intuitively, the fact that the fc-th component of a family can be derived from the 
previous component in a simple way guarantees uniformity of the family na. 

The vocabulary can be used to define several other aggregate functions. 
For example, the function even can be defined as eveNq = 1 and EVENfc = 
1 — EVENfc_i, for each fc > 0, in which 0 is interpreted as false and 1 as true. A 
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Fig. 1. A circuit representation of the aggregate function AVG 



different family for the same function is such that even^^, = 1 and eveN 2 j,_|_]^ = 0, 
for each k > 0. The aggregate function exp, such that exp^ = 2^, can be defined 
as EXPo = 1 and exp^ = double {exp k-i) for A: > 0, where double{x) = x + x. 
Note that the obvious description 1 + 1 + . . . + 1 for exp^ (with 2^ number of I’s 
and 2^ — 1 number of +’s) cannot be generated by a logspace computation, since 
it requires a number of operators which is exponential in k. Further uniform 
aggregate functions over are the following: any fixed natural or rational 
constant c; the function fib that generates Fibonacci’s numbers; the function 
FATT, such that FATTfe = k\; the function prod, such that prod^ = xi*. . .*Xk = 




Functions like COUNT, even, and exp do not depend on the actual values of 
the elements in the argument (multi)set, but rather on the number of elements 
it contains. According to our definition, these are indeed uniform counting func- 
tions. 

Although many interesting functions can be defined in terms of L?q, this 
vocabulary does not allow the definition of the minimum and maximum functions 
MIN and MAX. To this end, we need to use further scalar functions to take care 
of order. For instance, using an ordering function >: (Q x (Q ^ {Ojl} &ud the 
functions in 17 q, it is easy to define also the functions =, yf, and >. Then, it can 
be shown that min and max are uniform aggregate functions over U {>}. 

2.3 A Hierarchy of Aggregate Functions 

Let A{f2) be the class of uniform aggregate functions over a vocabulary 17. Ac- 
cording to our definition, an element of A{f2) can be represented by a uniform 
family of functions whose description of the n-th component can be generated 
by a Turing machine using 0(log n) workspace. Actually, a Turing machine us- 
ing 0(log n) workspace is essentially equivalent to a program that uses a fixed 
number of counters, each ranging over {1, . . . ,n}; it is then apparent that the 
expressiveness of such a machine increases with the number of available coun- 
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ters. To capture this fact, we now introduce a hierarchy of classes of uniform 
aggregate functions, as follows. 

Definition 3. A^{f2) is the class of uniform aggregate functions such that the 
description of the n-th component has size 0{n^). 

Clearly, A{f2) = Ufe>o-^^(^) C for each k>Q. Actually, 

it turns out that there are vocabularies for which the above hierarchy is proper. 



Lemma 1. There are a vocabulary 17 and a natural k such that 71^(17) C 

A'^+\Q). 

Sketch of proof. Let l7a = {O®,©,®} be an “abstract” vocabulary, where 0® is 
a constant, © is a commutative binary function, and ® is an arbitrary binary 
function. Consider the uniform aggregate function a over 17^ such that 0 !„ = 
Xi © Xj , where 0 is defined in terms of © using Oq as “initial value” . 
Since the n-th component contains O(n^) operators, the function a is in 
A'^{f2a) by definition. However, it turns out that a does not belong to A^(f2a), 
since it is not possible to represent its components using a linear number of 
operators. □ 

Remark 1. Let us now briefly discuss the above result. Consider the following 
“concrete” interpretation Qa for the abstract vocabulary Qa'- 0® is 0 (the rational 
number zero), © is + (addition of rational numbers), and © is * (multiplication of 
rational numbers); multiplication is distributive over addition. According to this 
interpretation l7a, the n-th component = 'Yhi j of a can be rewritten 

which has 0(n) operators, and therefore a belongs to 

A^{fia)- Conversely, consider the function [3 such that /3„ = Xi © Xj. It 

is easy to see that fi belongs to A^(f2a) — A^{f2a). Furthermore, even with 
respect to the interpretation f2a for the vocabulary 17 q, the function (3 belongs 
to A^{Qa) — A^{Qa)- It would belong to A^{Qa) if contained the difference 
operator — ; in this case, /3„ can be rewritten as (^^ Xi) * (^^ Xj) — * Xk). 

Thus, in general, properties of the aggregate functions in the class A{Q) depend 
on properties of the scalar functions in the vocabulary 17. □ 

3 A Query Language with Interpreted Functions 

In this section we investigate the embedding of a class of aggregate functions 
within a database query language. We first introduce an algebra for complex 
values then extend it with interpreted scalar and aggregate functions. 

3.1 The Data Model 

We consider a two-sorted data model for complex values, over a countably in- 
finite, uninterpreted domain T> and the interpreted domain (Q of the rational 
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numbers. We fix two further countably infinite disjoint sets: a set A of attribute 
names and a set TZ of complex-values names. The types of the model are recur- 
sively defined as follows: (i) V and Q are atomic types; (ii) if ti, . . . , are types 
and Ai,. . . ,Ak are distinct attribute names from A, then [Ai : ti, . . . , Ak '■ Tk] is 
a tuple type {k > 0); and (iii) if t is a type, then {r} is a set type. The domain 
of complex values associated with a type is defined in the natural way. 

A database scheme is atuple of the form (si : ri, . . . , s„ : r„), where si, . . . , 
are distinct complex-values names from TZ and each Ti is a type. A database 
instance is a function mapping each complex-values name Si to a value of the 
corresponding type for 1 < i < n. Note that the Si’s are not required to be 
sets of (complex) values. 

3.2 The Complex- Values Algebra 

Our reference language, denoted by CVA, is a variant of the complex-values al- 
gebra of Abiteboul and Beeri Q without powerset. The language is based on 
operators, similar in spirit to relational algebra operators, and function con- 
structors, which are high-order constructors used to apply functions to complex 
values. We now briefly recall the main features of the language, referring the 
reader to Q for further details. 

The expressions of the language describe function^ from a database scheme 
to a certain type, which are built by combining operators and function construc- 
tors starting from a basic set of functions. More specifically, the base functions 
include constants and complex-values names from TZ (viewed as nullary func- 
tions), attribute names from A, the identity function id, the set constructor {}, 
the binary predicates =, €, and C, the boolean connectives A, V, and The 
operators include the set operations U, fl, and — , the cross product (which is a 
variant of the k-axy Cartesian product), and the set-collapse (which transforms 
a set of sets into the union of the sets it contains) . 

The function constructors allow the definition of further functions, as follows: 

— the binary composition constructor fog, where / and g are functions, defines 
a new function whose meaning is “apply g, then /” ; 

— the labeled tuple constructor [Ai = /i,...,A„ = /„], where fi,...,fn are 
unary functions and Ai, ... , A„ are distinct attribute names, defines a new 
function over any type as follows: 

[Ai = = fn](s) = [Ai : /i(s),...,A„ : f„(s)]; 

— the replace constructor replace (f), where f is a unary function, defines a 
new function over a set type as: replace{f){s) = {f{w) | w S s}; 

— the selection constructor select {c), where c is a unary boolean function, 
defines a new function over a set type as: select (c){s) = {w \ w G 
s and c{w) is true}. 

^ We mainly refer to a general notion of function rather than query because, in order 
to include aggregations, it is convenient to relax the assumption that the result of a 
CVA expression is a set. 
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To increase readability, we will often use an intuitive, simplified notation. For 
example, if A and B are attributes and / is a function, we write A.B instead of 
B o A, A{f) instead of A o /, and A G B instead of €o[A,B]. For the labeled 
tuple constructor, we write [A] instead of [A = A], and [A.B] instead of \B = 
A.B] (that is, an attribute takes its name from the name of the last navigated 
attribute). 

As an example, let si and S 2 be complex- values names, having types {[A : 
T>,B : T)]} and {[E : {T>}^F : I?]}, respectively. Then the following expression 
computes the projection on A and F of the join of si with S 2 based on the 
condition B G E-. 

replace{[X. A, Y.F]) (select {X.B G Y.E)(cross[x,Y](si, S 2 ))). 

3.3 Adding Interpreted Functions 

Interpreted functions (that is, functions whose semantics is defined outside the 
database) can be included in CVA in a very natural way |H. 

Let us first consider a set f? of interpreted scalar functions over (Q. These 
functions can be embedded into CVA by simply extending the set of available base 
functions with those in 17, and allowing their application through the function 
constructors. For instance, if -I- is in 17, we can extend the tuples of a complex 
value r of type {[A : : Q]} with a further attribute C holding the sum of 

the values in A and B by means of the expression replace([A^ B,C = A-l-i?])(r). 

Let us now consider a set F of aggregate functions over (Q. In this case we 
cannot simply extend the base functions with those in F, since we could obtain 
incorrect results. For example, if we need to sum the A components of the tuples 
in the complex value r above, we cannot use the expression SUM(repZace(A)(r)), 
since this would eliminate duplicates before the aggregation. Therefore, we in- 
troduce an aggregate application constructor g^f^, where g is an aggregate 
function in F and / is any CVA function. For a set s, g^f^(s) specifies that 
g has to be applied to the multiset ^f{w) | w G sj- rather than to the set 
{f(w) I w G s}. Thus, the above function can be computed by means of the 
expression SUM-§)A5-(r). 

In the following, we will denote by CVA -1- 17 -|- T the complex- values algebra 
extended in this way with the scalar functions in 17 and the aggregate functions 
in F. 



3.4 Expressive Power and Complexity 



It is well-known that the complex-values algebra CVA is equivalent to a lot of 
other algebraic or calculus-based languages (without powerset) over complex val- 
ues and nested collections proposed in the literature jhltil 1 9l2lTj . It turns out that 
CVA expresses only functions (over uninterpreted databases) that have ptime 
data complexity. 

When considering the complexity of functions over numeric interpreted do- 
mains, a cost model has to be specified, since it can be defined in several different 
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ways. For instance, it is possible to consider a number as a bit-sequence and de- 
fine the complexity of a function over a tuple of numbers with respect to the 
length of the representations of the numbers. However, we prefer to treat num- 
bers as atomic entities, and assume that basic scalar functions are computed in a 
single step, independently of the magnitude or complexity of the involved num- 
bers mini. Specifically, we assume that the computational cost of any scalar 
function that we consider is unitary; this is indeed a reasonable choice when 
arithmetic operations and comparisons over the natural or rational numbers are 
considered. Under this assumption, the data complexity of CVA+ fi is in ptime. 
Furthermore, if we consider a class F of aggregate functions having polynomial 
time complexity in the size of their input then, by a result in CVA + + F 

remains in ptime. 

By noting that, for a collection f? of scalar functions, the complexity of 
evaluating aggregate functions in A{f^) is polynomial in the size of their input, 
the following result easily follows. 

Theorem 1. Let Q be a collection of scalar functions. Then CVA + L2 + A{f2) 
has PTIME data complexity. 

4 An Operator for Defining Aggregate Functions 

In this section we investigate the introduction, in the query language CVA, of 
a high-order constructor, called folding^ that allows us to define and apply an 
aggregation. This operator is essentially based on structural recursion m. 

4.1 Folding Expressions 

A folding signature is a triple (Ti,Ta,To) of types, with the restriction that the 
type Ta does not involve the set type constructor. As we will clarify later, this 
type restriction ensures tractability of folding. 

Definition 4. A folding constructor of signature {Ti,Ta,To) is an expression of 
the form foldijo] i] q) , where: 

— p is a CVA function of type ^ Ta, called pre-function; 

— i is a left-commutativ^ CVA function Ti x Ta ^ Ta over the symbols curr 
and ace, called increment function; and 

— q is a CVA function of type Ta Tq, called post-function. 

A folding fold (p; z; q) of signature (r^, Ta, To) defines a function of type ^ Tq] 
the result of applying this function over a collection s is computed iteratively as 
follows: 

ace := p(s); 

for each curr in s do ace := i{curr, ace); 
return q(acc); 

® A binary function / is left-commutative if it satisfies the condition f{xi, f{x 2 , y)) = 
f(x 2 ,f{xi,y)). Commutativity implies left-commutativity, but not vice versa. 
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Initially, the result of applying the pre-function p to s is assigned to the “accumu- 
lator” variable acc. Then, for each element curr of s (chosen in some arbitrary 
order), the result of applying i to curr and acc is (re) assigned to acc (curr 
stands for current element). Finally, the result of the whole expression is given 
by the application of the post-function q to the value of acc at the end of the 
iteration. It is important to note that, since the function i is left-commutative, 
the semantics of folding is well-defined, that is, the result is independent of the 
order in which the elements of s have been selected. 

For example, let r be a complex value of type {[A : T>, S : Q]}. The count of 
the tuples in r can be computed by the expression: fold{0; acc + 1; id)^id^(r). 
(Note that we often use the folding constructor together with the aggregate 
application constructor.) Similarly, the sum of the B components of the tu- 
ples in r is given by fold{0;acc + B(curr);id)^id^(r) or, equivalently, by 
fold{0; acc + curr; id)^B^(r). The average of the B components can be com- 
puted by evaluating the sum and the count in parallel, and then taking their 
ratio: 

fold{[S = 0, C = 0]; [S = acc.S + curr.B, C = acc.C + 1]; S/C)lid^(r). 

Folding also allows the computation of aggregations over nested sets. For in- 
stance, let s be an expression of type {[A : T>,B : {Q}]}, and assume that we 
want to extend each tuple by a new component holding the sum of the set of 
numbers occurring in the B component. This can be obtained by: 

replace{[A, B,C = fold{0; acc + curr; zd)-§)z(i5’(^)])(s)- 

This expression suggests that an SQL query with the group-by clause can be 
specified in this algebra by applying fold after a nesting. 

The restriction on the type Ta imposed in the definition of folding signature 
has been introduced to guarantee tractability of the folding constructor. In fact, 
the unrestricted folding constructor, fold^'^^ , in which the accumulator can be 
of any complex type, can express very complex data manipulations, including 
replace, set-collapse, and powerset: 

powerset(s) = /oW“"”({{}}; acc U replace{id U {curr}){acc); zd)-§)z(i5’(s)- 



4.2 Numeric Folding 

We now consider a further constraint on the definition of folding, which essen- 
tially allows only the computation of numeric aggregations. This is coherent with 
our approach that tends to make a clear distinction between the restructuring 
capabilities of a query language and its ability to compute aggregations. 

Definition 5. Let L2 he a vocabulary. A numeric folding over fi is a folding 
constructor 4> that satisfies the following conditions: 

— the signature of (f> has the form (Q,Ta,Q), where Ta is either Q or [Ai : 
Q, . . . , Afe : Q], with k > 0; 
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— the pre-function, the increment function, and the post-function of 4> can use 
only the following functions and constructors: the identity function, binary 
composition, the labeled tuple constructor, the scalar functions in fl, the 
numeric folding constructor, and possibly the attribute names A\, . . . , Ak- 

Definition 6. A counting folding over a vocabulary Q is a numeric folding over 
fi in which the increment function does not use the symbol curr. 



Lemma 2. Let L2 be a vocabulary. Then: 

— every numeric folding over fl computes a uniform aggregate function over fl; 

— every counting folding over fl computes a uniform counting function over fl. 

For instance, fold{0; acc-\-curr; id) is a numeric folding that computes the aggre- 
gate function SUM, whereas fold {0; acc-bl; id) is a counting folding that computes 
the counting function COUNT. 



4.3 Expressive Power of Query Languages with Folding 

In this section we relate the language CVA -1-17-1- fold, in which aggregations 
are computed using the folding constructor, with CVA-\- fl -\- fold, in which only 
numeric foldings are allowed. We consider also the weaker languages CVA -\- 
fl -\- fold‘d and CVA -1-17-1- fold , in which the former disallows the use of the 
symbol curr (and thus increment functions cannot refer to the current element 
while iterating over a collection) and the latter has only the counting folding 
constructor. 

For example, the following is a CVA-\- fl -\- fold expression, over the complex 
value s of type {{2?}} (a set of sets) that computes the sum of the cardinalities 
of the sets in s: 

fold{0; ace -\- fold{0; ace -\- 1; id)^id^(curr); zd)-§)zd5’(s)- 

In the foregoing expression, the outer folding is not numeric (since it applies to 
a set of sets). There is however an equivalent CVA-\- fl -\- fold expression: 

fold{0; ace -f curr; id)^fold{0; ace -f 1; zd)-§) 05 - 5 -(s). 

This example suggests that expressions involving generic foldings can be rewrit- 
ten into expressions involving only numeric foldings, by exploiting the restruc- 
turing capabilities of CVA. This is confirmed by the following result. 

Theorem 2. Let fl be a vocabulary. Then: 

— CVA -t- 17 -b fold and CVA -\- fl -\- fold have the same expressive power; 

— CVA -\- fl -\- fold‘d and CVA -1-17-1- fold have the same expressive power. 
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5 Querying with Aggregate Functions 

In this section we study the relationship between the class of aggregate func- 
tions that can be expressed by the numeric folding and the class of the uniform 
aggregate functions, as well as the effect of their inclusion in CVA. 

We first show that numeric folding, considered as a stand-alone operator, is 
weaker than uniform aggregation. Specifically, the following result states that 
any aggregate function expressed by a numeric folding can be described by a 
uniform aggregate function whose description has size linear in the number of 
its arguments. 

Let A(f2) be the class of the aggregate functions computed by numeric fold- 
ings over a vocabulary J7, and be the class of counting functions defined 

by counting foldings over f2. Moreover, let C(f2) be the class of uniform counting 
functions over f2. 

Theorem 3. Let L2 he a vocabulary. Then .F(I7) C and C C^[f2). 

Actually, we conjecture that T{f2) = A^{L2) and lF'=(f7) = C^(I7), but we do not 
have a proof at the moment. 

Thus, in general, we have that C A{f2) and lF'^(f7) C C(I7) but, ac- 

cording to Lemma^ there are vocabularies for which the containment is proper, 
that is, lF(f7) C A(I7) and C C{f2). These containments suggest that nu- 

meric folding presents a limitation in computing aggregate functions, as it is not 
able to capture an entire class A(f2). On the other hand, it is possible to show 
that this deficiency can be partially remedied by the restructuring capabilities 
of a query language. For instance, consider again the uniform aggregate function 
a introduced in the proof of Lemma Q with = ©, j Xi ® Xj , which belongs to 
A‘^{f2a) — A^{f2a) and, as such, it cannot be expressed as an aggregate function 
by a numeric folding. However, it can be expressed as an aggregate query in 
CVA + f2a+ fold, as follows: 

fold{0^; ace 0 curr; id)fAi 0 A 2 \{cross[Ai,A- 2 \{'^d, id)). 



Lemma 3. Let f2 he a vocabulary. Then, for each k > 0: 

— there are functions in — A^~^ {Q) that can he expressed in CVA+ L2 + 

fold; 

— there are functions in C^{Q) —C^~^{fl) that can he expressed in CVA+ f2 + 

Actually, there are uniform aggregate functions that cannot be expressed using 
aggregate queries with folding, showing that, in general, it can be better to have 
at disposal a language for expressing aggregate functions (such as the uniform 
ones) outside the database query language. 
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Theorem 4. There is a vocabulary fl such that: 

— CVA + f2 + A{f2) is more expressive than CVA + Q + fold; 

— CVA + f2 + C(f2) is more expressive than CVA + f2 + fold . 

Sketch of proof. Consider again the abstract vocabulary introduced in the 
proof of Lemma ^ and assume that the binary function ® is commutative. 
Then, the family 7 of functions such that 7 ^ = Xi®Xj represents a uniform 

aggregate function, which belongs to A^{f2a)—A^(f2a)- It turns out that 7 cannot 
be expressed using CVA + + fold. □ 

6 Conclusions and Future Work 

We believe that the framework proposed in this paper can be fruitfully used as a 
formal foundation for further studies on the relationship between aggregate func- 
tions and aggregate queries. In particular, the relationship between the classes 
CV.4 + + fold and CVA + + A(i7) deserves a deeper investigation. 

On one hand, we plan to investigate the implications on considering mutual 
properties of functions in a vocabulary, such as commutativity, associativity, and 
distributivity. For instance, the function 7 introduced in the proof of Theorem 21 
is an aggregate function only if 0 is commutative. What are the vocabularies for 
which CVA + f2 + fold and CVA + + A{f2) have the same expressive power? 

On the other hand, there are simple logspace computations that cannot be 
easily captured by folding. The definition of the functions (3 (Remark 0 and 7 
(proof of Theorem ^ shows that uniform construction can compare indexes of 
the arguments (as i < j in 0-^'’ xi 0 Xj and f j in 0-^^ Xi 0 Xj). Such a 
capability can be partially captured by referring to total ordered domains (both 
T> and Q) and a total order predicate > (both as a base function in CVA and in 
the vocabulary J7). This extension is another topic that we plan to investigate, 
first of all by tackling the following claim: 

Conjecture 1. Let 17 be a vocabulary. Then CVA+Q+fold and CVA+ fi + A{C) 
have the same expressive power over totally ordered domains. 
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Abstract. In this paper, we address the problem of finding frequent 
itemsets in a database. Using the closed itemset lattice framework, we 
show that this problem can be reduced to the problem of hnding frequent 
closed itemsets. Based on this statement, we can construct efficient data 
mining algorithms by limiting the search space to the closed itemset 
lattice rather than the subset lattice. Moreover, we show that the set of all 
frequent closed itemsets suffices to determine a reduced set of association 
rules, thus addressing another important data mining problem: limiting 
the number of rules produced without information loss. We propose a new 
algorithm, called A-Close, using a closure mechanism to find frequent 
closed itemsets. We realized experiments to compare our approach to 
the commonly used frequent itemset search approach. Those experiments 
showed that our approach is very valuable for dense and/or correlated 
data that represent an important part of existing databases. 



1 Introduction 

The discovery of association rules was first introduced in This task con- 
sists in determining relationships between sets of items in very large databases. 
Agrawal’s statement of this problem is the following P,[2 . Let X = {f i , , . . . ,im} 

be a set of m items. Let the database T> = {ti,t 2 ,. ■ ■ ,tn} be a set of n trans- 
actions, each one identified by its unique TID. Each transaction t consists of a 
set of items I from X. If ||/|j = k, then I is called a k-itemset. An itemset I is 
contained in a transaction t G T> if I C t. The support of an itemset I is the 
percentage of transactions in T> containing I. Association rules are of the form 
r : /i A I 2 , with I\,l 2 C X and h I 2 = 0- Each association rule r has a 
support defined as supporter) = support{Ii U I 2 ) and a confidence c defined as 
confidence{r) = support{I\\Jl 2 ) / support{I\) . Given the user defined minimum 
support minsup and minimum confidence minconf thresholds, the problem of 
mining association rules can be divided into two sub-problems p[J: 

1. Find all frequent itemsets in T>, i.e. itemsets with support greater or equal 
to minsup. 



Catriel Beeri, Peter Buneman (Eds.): ICDT’99, LNCS 1540, pp. 398-^^] 1998. 
(c) Springer- Verlag Berlin Heidelberg 1998 
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2. For each frequent itemset I\ found, generate all association rules I 2 I 1 —I 2 

where I 2 C /i, with confidence c greater or equal to minconf. 

Once all frequent itemsets and their support are known, the association rule 
generation is straightforward. Hence, the problem of mining association rules is 
reduced to the problem of determining frequent itemsets and their support. 

Recent works demonstrated that the frequent itemset discovery is also the key 
stage in the search for episodes from sequences and in finding keys or inclusion as 
well as functional dependencies from a relation [E|. All existing algorithms use 
one of the two following approach: a levelwise H2] bottom-up search 0 0 El ^1 
ITT] or a simultaneous bottom-up and top-down search PlE3En]- Although they 
are dissimilar, all those algorithms explore the subset lattice (itemset lattice) for 
finding frequent itemsets: they all use the basic properties that all subsets of a 
frequent itemset are frequent and that all supersets of an infrequent itemset are 
infrequent in order to prune elements of the itemset lattice. 

In this paper, we propose a new efficient algorithm, called A-Close, for find- 
ing frequent closed itemsets and their support in a database. Using a closure 
mechanism based on the Galois connection, we define the closed itemset lattice 
which is a sub-order of the itemset lattice, thus often much smaller. This lat- 
tice is closely related to the Galois lattice iHH also called concept lattice HS|. 
The closed itemset lattice can be used as a formal framework for discovering 
frequent itemsets given the basic properties that the support of an itemset I is 
equal to the support of its closure and that the set of maximal frequent itemsets 
is identical to the set of maximal frequent closed itemsets. Then, once A-Close 
has discovered all frequent closed itemsets and their support, we can directly 
determine the frequent itemsets and their support. Hence, we reduce the prob- 
lem of mining association rules to the problem of determining frequent closed 
itemsets and their support. 

Using the set of frequent closed itemsets, we can also directly generate a 
reduced set of association rules without having to determine all frequent item- 
sets, thus lowering the algorithm computation cost. Moreover, since there can 
be thousands of association rules holding in a database, reducing the number 
of rules produced without information loss is an important problem for the un- 
derstandability of the result m- Empirical evaluations comparing A-Close to 
an optimized version of Apriori showed that they give nearly always equivalent 
results for weakly correlated data (such as synthetic data) and that A-Close 
clearly outperforms Apriori for correlated data (such as statistical or text data) . 

The rest of the paper is organized as follows. In Section El we present the 
closed itemset lattice. In SectionEl we propose a new model for association rules 
based on the Galois connection and we characterize a reduced set of association 
rules. In Sectional we describe the A-Close algorithm. Section Elgives experimen- 
tal results on synthetic data0 and census data using the PUMS file for Kansas 
\jsM and Section El concludes the paper. 

^ http : //www. almaden. ibm. com/ cs/quest/ syndata.html 
^ ftp : / /ftp2 . cc .ukans . edu/pub/ ippbr/ census/pums/pums90ks .zip 
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2 Closed Itemset Lattices 

In this section, we define data mining context, Galois connection, Galois closure 
operators, closed itemsets and closed itemset lattice. Interested readers should 
read ^ 0 E) for further details on order and lattice theory. 

Definition 1 (Data mining context). A data mining context is a triple V = 
(0,I,TZ). O and X are finite sets of objects and items respectively. TZ C O x X 
is a binary relation between objects and items. Each couple (o, i) G TZ denotes 
the fact that the object o G O is related to the item i gX. 

Definition 2 (Galois connection). Let X) = {0,X,TZ) be a data mining con- 
text. For O O and I <GX, we define: 

f{0 ) : 2° ^ 2^ g{I ) : 2^ ^ 2° 

f{0)={iGX I yoGO,{o,i) GTZ} g{I)={oGO \ yiGl,{o,i) GTZ} 

f{0) associates with O the items common to all objects o G O and g{I) associates 
with I the objects related to all items i G I. The couple of applications {f,g) is 
a Galois connection between the power set of O (i.e. 2®^ and the power set ofX 
(i.e. 2^). The following properties hold for all I,Ii,l 2 C X and 0,0\,02 ^ O: 

(1) /i C /2 ^ g{h) D g{l 2 ) (!’) Oi C O 2 ^ /(Oi) 3 /(O 2 ) 

(2) O C g(I) ^ I C f(0) 

Definition 3 (Galois closure operators). The operators h = fog in 2^ and 

h' = gof in 2® are Galois closure operator^. Given the Galois connection (f,g), 
the following properties hold for all I,Ii,l 2 QX and O, Oi, O 2 C O gj 0 UVf : 

Extension : (3) I C h{I) (S’) O C h' {O) 

Idempotency : (f) h(h{I)) = h{I) (f ) h'(h'{0)) = h'{0) 

Monotonicity : (5) h Q I 2 ^ ^(C) C h(l 2 ) (5 ) 0\ C O 2 => h'{0\) C h' {O 2 ) 



Definition 4 (Glosed itemsets). An itemset C CX from T> is a closed itemset 
iff h{C) = C. The smallest (minimal) closed itemset containing an itemset I is 
obtained by applying h to I. We call h{I) the closure of I. 

Definition 5 (Glosed itemset lattice). Let C be the set of closed itemsets 
derived from T> using the Galois closure operator h. The pair Cc = {C, <) is 
a complete lattice called closed itemset lattice. The lattice structure implies two 
properties: 

i) There exists a partial order on the lattice elements such that, for every ele- 
ments Ci,C 2 G Cc, Cl < C 2 , iff Cl C C^. 

® By extension, we call database a data mining context afterwards. 

^ Here, we use the following notation: f°g{I) = f{g{I)) and g°f{0) = g{f{0)). 

® Cl is a sub-closed itemset of C 2 and C 2 is a sup-closed itemset of Gi. 
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ii) All subsets of Cc have one greatest lower bound, the Join element, and one 
lowest upper bound, the Meet element. 

Below, we give the definitions of the Join and Meet elements extracted from the 
basic theorem on Galois (concept) lattices 0 0 ^or all S C Cq: 



Join {S) = h{ U^)- 
cgs 



OID 


Items 
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D 
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E 
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E 
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B 


C 


E 



Meet (S') = Pi C 
ces 




Fig. 1. The data mining context T> and its associated closed itemset lattice. 



3 Association Rule Model 

In this section, we define frequent and maximal frequent itemsets and closed 
itemsets using the Galois connection. We then define association rules and valid 
association rules, and we characterise a reduced set of valid association rules in 
a data mining context T>. 

3.1 Frequent Itemsets 

Definition 6 (Itemset support). Let I C X be a set of items from V. The 
support count of the itemset I in V is: 

support{I) = 

Definition 7 (Frequent itemsets). The itemset I is said to be frequent if the 
support of I in T> is at least minsup. The set L of frequent itemsets in T> is: 

L = {I Cl I support (I) > minsup} 
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Definitions (Maximal frequent itemsets). Let L be the set of frequent 
itemsets. We define the set M of maximal frequent itemsets in T> as: 

M = {I eL \ e L, / C /'} 



Property 1. All subsets of a frequent itemset are frequent (intuitive in |2j). 

Proof. Let /, /' C I, / g L and I' C I. According to Property (1) of the Galois 
connection: P C I g{I') 7^ g{I) support(P) > support(J) > minsup. 
So, we get: /' S L. 



Property 2. All supersets of an infrequent itemset are infrequent (intuitive in 

0 )- 

Proof. Let I, J' C X, ^ L and I' C I. According to Property (1) of the Galois 

connection: / 3 /' g{I) C g{P) support(/) < support(I') < minsup. 

So, we get: I ^ L. 

3.2 Frequent Closed Itemsets 

Definition 9 (Frequent closed itemsets). The closed itemset C is said to 
be frequent if the support of C in T> is at least minsup. We define the set PC of 
frequent closed itemsets in T> as: 

FC = {C C X I C = h(C) A supportf C) > minsup} 

Definition 10 (Maximal frequent closed itemsets). Let FC be the set of 

frequent closed itemsets. We define the set MC of maximal frequent closed item- 
sets in T> as: 

MC = {C G FC \ ^C' GFC, Cc C'} 



Property 3. The support of an itemset / is equal to the support of its closure: 
support(I) = support {h (L)) . 



Proof. Let J C X be an itemset. The support of / in X> is: 



support{I) = 



IlgWII 

l|0|! 



Now, we consider h{L), the closure of /. Let’s show that h'{g{L)) = g{I). We 
have g{L) C h{g{L)) (extension property of the Galois closure) and L C h{L) => 
g{h{L)) C g[L) (Property (1) of the Galois connection). We deduce that h'{g{L)) = 
g{I), and therefore we have: 



||g(h(/))|| ||h-(g(/))|| ||g(/)|| 

ll^ll 



support{h{I)) 



support(L) 
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Property 4- The set of maximal frequent itemsets M is identical to the set of 
maximal frequent closed itemsets MC. 

Proof. It suffices to demonstrate that V/ G M, I is closed, i.e. / = h{I). Let 
/ S M be a maximal frequent itemset. According to Property (3) of the Galois 
connection / C h{I) and, since I is maximal and support(h{I)) = support(I) > 
minsup, we conclude that / = / is a maximal frequent closed itemset. 

Since all maximal frequent itemsets are also maximal frequent closed itemsets, 
we get: M = MC. 



3.3 Association Rule Semantics 



Definition 11 (Association rules). An association rule is an implication be- 
tween itemsets of the form Ii -^/2 where I\,l 2 C X and Ji n /2 = 0. Below, we 
define the support and confidence (c) of an association rule r \ I\ — > I 2 using 
the Galois connection: 



supporter) 



lloll 



confidence{r) 



support {I I U I 2 ) 
support {I i) 



llg(/iu/2)|| 

llg(/i)ll 



Definition 12 (Valid association rules). A valid association rules is an as- 
sociation rules with support and confidence greater or equal to the minsup and 
minconf thresholds respectively. We define the set ATZ of valid association rules 
in V using the set MC of maximal frequent closed itemsets as: 

ATZ{T), minsup, minconf) = {r : I 2 Ii — I 2 , I 2 C I\ \ I\ G L = 2*^ and 

C&MC 

confidence{r) > minconf} 



3.4 Reduced Set of Association Rules 

Let Ii,l 2 C X and /i n /2 = 0. An association rule r : 7i /2 is an exact 
association rule if c = 1. Then, r is noted r : I\ ^ I 2 . An association rule 
r ■. Ii I 2 where c < 1 is called an approximate association rule. Let X> be a 
data mining context. 

Definition 13 (Pseudo-closed itemsets). An itemset I C I from V is a 
pseudo-closed itemset iff h{I) ^ I and V/' C I such as I' is a pseudo-closed 
itemset, we have h(I') C I . 



Theorem 1 (Exact association rules basis P^). Let P be the set of pseudo- 
closed itemsets and TZ the set of exact association rules in V. The set £ = {r : 
7i=>/i(/i) — 7i I 7i G P} is a basis for all exact association rules. Vr' G TZ where 
confidence (r' ) = 1 > minconf we have £ |= r' . 
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Corollary 1 (Exact valid association rules basis). Let FP be the set of fre- 
quent pseudo-closed itemsets in V. The set BE = {r : /i h{I{) — /i | /i S FP} 
is a basis for all exact valid association rules. Vr' G ATZ where confidence (r' ) = 
1 we have BE |= r' . 

Theorem 2 (Reduced set of approximate association rules [1 1) 1. Let C 

be the set of closed itemsets and TZ the set of approximate association rules in 
T>. The set A = {r : Ii I 2 — h \ I 2 C h /\ I 1 A 2 G C} is a correct reduced set 
for all approximate association rules. Vr' G TZ where minconf < confidence (r' ) 
< 1 we have A 1= r'. 

Corollary 2 (Reduced set of approximate valid association rules). Let 

FC be the set of frequent closed itemsets in T>. The set BA = {r : Li L 2 — 
/i I I 2 C /i A /i ,/2 G FC} is a correct reduced set for all approximate valid 
assocition rules. Vr' G ATZ where confidence (r' ) < 1 we have BA |= r'. 



4 A-Close Algorithm 

In this section, we present our algorithm for finding frequent closed itemsets and 
their supports in a database. Section 10 describes its principle. In Section 01 
to 14.51 we give the pseudo-close of the algorithm and the sub-functions it uses. 
Section provides an example and the proof of the algorithm correctness. 



4.1 A- Close Principle 

A closed itemset is a maximal set of items common to a set of objects. For 
example, in the database T> in Figure [IJ the itemset BCE is a closed itemset 
since it is the maximal set of items common to the objects {2, 3, 5}. BCE is called 
a frequent closed itemset for minsup = 2 as support(RC£') = ||{2,3,5}|| = 3 > 
minsup. In a basket database, this means that 60% of customers (3 customers 
on a total of 5) purchase at most the items B,C and E. The itemset BC is 
not a closed itemset since it is not a maximal group of items common to some 
objects: all customers purchasing the items B and C also purchase the item E. 
The closed itemset lattice of a finite relation (the database) is dually isomorphic 
to the Galois lattice 01 Q, also called concept lattice uni. 

Based on the closed itemset lattice properties (Section 0 and 0|), using the 
result of A-Close we can generate all frequent itemsets from a database T> through 
the two following phases: 

1. Discover all frequent closed itemsets in T>, i.e. itemsets that are closed and 
have support greater or equal to minsup. 

2. Derive all frequent itemsets from the frequent closed itemsets found in phase 1. 
That is generate all subsets of the maximal frequent closed itemsets and de- 
rive their support from the frequent closed itemset supports. 
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A different algorithm for finding frequent closed itemsets and algorithms for 
deriving frequent itemsets and generating valid association rules are presented 

in [T^. 

Using the result of A-Close, we can directly generate the reduced set of 
valid association rules defined in Section instead of determining all frequent 
itemsets. The procedure is the following: 

1. Discover all frequent closed itemsets in T>. 

2. Determine the exact valid association rule basis: determine the pseudo-closed 
itemsets in T> and then generate all rules r \ I\ — > I 2 — h \ Ii I 2 where I 2 
is a frequent closed itemset and /i is a frequent pseudo-closed itemset. 

3. Construct the reduced set of approximate valid association rules: generate 
all rules of the form: r : I\ I 2 — h \ Ii C I 2 where I\ and I 2 are frequent 
closed itemsets. 

In the two cases, the first phase is the most computationally intensive part. 
After this phase, no more database pass is necessary and the later phases can 
be solved easily in a straightforward manner. Indeed, the first phase has given 
us all information needed by the next ones. 

A-Close discovers the frequent closed itemsets as follows. Based on the closed 
itemset properties, it determines a set of generators that will give us all frequent 
closed itemsets by application of the Galois closure operator h. An itemset p is a 
generator of a closed itemset c if it is one of the smallest itemsets (there can be 
more than one) that will determine c using the Galois closure operator: h(p) = c. 
For instance, in the database T> (Figure [Q, BC and CE are generators of the 
closed itemset BCE. The itemsets B, C and E are not generators of BCE since 
h{C) = C and h{B) = h{E) = BE. The itemset BCE is not a generator of itself 
since it includes BC and CE: BCE is not one of the smallest itemsets for which 
closure is BCE. 

The algorithm constructs the set of generators in a levelwise manner: (i-|-l)- 
generator^ are created using ^-generators in Gi. Then, their support is counted 
and the useless generators are pruned. According to their supports and the sup- 
ports of their Fsubsets in Gi, infrequent generators and generators that have 
the same closure as one of their subsets are deleted from Gi+\. In the previ- 
ous example, the support of the generator BCE is the same as the support of 
generators BC and CE since they have the same closure (Property 3). 

Once all frequent useful generators are found, their closures are determined, 
giving us the set of all frequent closed itemsets. For reducing the cost of the 
closure computation when possible, we introduce the following optimization. We 
determine the first iteration of the algorithm for which a (i-l-l)-generator was 
pruned because it had the same closure as one of its Fsubsets. In all iterations 
preceding the one, the generators created are closed and their closure com- 
putation is useless. Hence, we can limit the closure computation to generators of 
size greater or equal to i. For this purpose, the level variable indicates the first 
iteration for which a generator was pruned by this pruning strategy. 



A generator of size i is called an i-generator. 
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4.2 Discovering Frequent Closed Itemsets 

As in the Apriori algorithm, items are sorted in lexicographic order. The pseudo- 
code for discovering frequent closed itemsets is given in Algorithm n The nota- 
tion is given in Table Q In each of the iterations that construct the candidate 
generators, one pass over the database is necessary in order to count the support 
of the candidate generators. At the end of the algorithm, one more pass is needed 
for determining the closures of generators that are not closed. If all generators 
are closed, this pass is not made. 



Set 



Field Contains 



Gi generator A generator of size i. 

support Support count of the generator: support = count [generator) 



G, G' generator A generator of size i. 

closure Closure of the generator: closure = h[generator). 
support Support count of the generator and its closure: 

support = count [closure) = count[generator) (Property 3). 



FG closure Frequent closed itemset (closed itemset with support > minsup). 
support Support count of the frequent closed itemset. 



Table 1. Notation 



First, the algorithm determines the set Gi of frequent 1-generators and their 
support (step 1 to 5). Then, the level variable is set to 0 (step 6). In each of 
the following iterations (step 7 to 9), the AC-Generator function f Section 14.411 is 
applied to the set of generators Gi, determining the candidate (i-l-l)-generators 
and their support in Gi+i (step 8). This process takes place until Gi is empty. 
Finally, closures of all generators produced are determined (step 10 to 14). Using 
the level variable, we construct two sets of generators. The set G which contains 
generators p for which size is less than level— 1, and so that are closed [p = h[p)). 
The set G' which contains generators for which size is at least level — 1, among 
which some are not closed, and so for which closure computation is necessary. 
The closures of generators in G' are determined by applying the AC-Generator 
function (Section 14.41 to G' (step 15). Then, all frequent closed itemsets have 
been produced and their support is known (see Theorem 3). 

4.3 Support-Count Function 

The function takes the set Gi of frequent (-generators as argument. It returns 
the set Gi with, for each generator p G Gi, its support count: support[p) = ||{o G 
O |pC/({o})||. The pseudo-code of the function is given in Algorithm |21 

The Subset function quickly determines which generators are contained in an 
objecifl, i.e. generators that are subsets of f({o}). For this purpose, generators 
are stored in a prefix- tree structure derived from the one proposed in CH 

^ We say that an itemset I is contained in object o if o is related to all items i £ I. 
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Algorithm 1 A-Close algorithm 

1) generators in Gi <— {1-itemsets}; 

2) Gi <— Support-Count (Gi); 

3) forall generators p G Gi do begin 

4) if (support (p) < minsup) then delete p from Gi; // Pruning infrequent 

5) end 

6) level ^ 0; 

7) for {i ^ 1; Gi. generator 7 ^ 0; do begin 

8) Gi+i ^ AC-Generator(Gi); // Creates (i-l-l)-generators 

9) end 

10 ) if {level > 2 ) then begin 

11) G ^ U{Gj I j < level-1}', II Those generators are all closed 

12) forall generators p G G do begin 

13) p. closure <— p. generator; 

14) end 

15) end 

16) if {level 7 ^ 0 ) then begin 

17) G' ^ I j > level-1}', / / Some of those generators are not closed 

18) G' ^ AC-Closure(G'); 

19) end 

20) Answer FC <— {c.closure,c.support|c G GuG'}; 



Algorithm 2 Support-Count function 

1) forall objects o £ O do begin 

2) Go <— Subset(Gi. generator, /({o})); // Generators that are subsets of f({o}) 

3) forall generators p £ Go do begin 

4) p. support -I--I-; 

5) end 

6 ) end 



4.4 AC-Generator Function 

The function takes the set Gi of frequent i-generators as argument. Based on 
Lemma 1 and 2 , it returns the set Gi+i of frequent (f-l-l)-generators. The pseudo- 
code of the function is given in Algorithm El 

Lemma 1. Let Ii,l2 he two itemsets. We have: 

h{h\Jl 2 )=h{h{I^)\Jh{l 2 )) 

Proof. Let Ji and I2 be two itemsets. According to the extension property of 
the Galois closure operators: 

Ii h{Ii) and I2 C h{l2) Ii G I2 1 = h{Ii) U ^(^2) 

=^h{hUl2)Ch{h{h)Uh{l2)) ( 1 ) 

Obviously, Ii C L U I2 and I2 Q h G I2. So h{Ii) C h{Ii U I2) and h{l2) C 
/i(/iU/2). According to the idempotency property of the Galois closure operators: 
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h{h{h)Uh{l2)) c h{h{hul2)) => h{h{h)Uh{l2)) c hihUh) (2) 

From (1) and (2), we conclude that h{Ii U I2) = h{h{Ii) U h{l2)). 



Lemma 2. Let I\ he an itemset and I2 a subset of I\ where support(Ii) = 
support(l2) ■ Then we have h{Ii) = h{l2) and V/3 C T, h{Ii U I3) = h{l2 U I3). 

Proof. Let /i, J2 be two itemsets where I2 C Ji and support(/i) = support(/2). 
Then, we have that |ig(/i)|| = ||5(^2)|| and we deduce that g{Ii) = gih)- From 
this, we conclude /(g(Ji)) = fig^h)) => = h(/2). Let J3 C I be an 

itemset. Then according to Lemma 1: 

h{h U I3) = h{h{h) U h{l3)) = h{h{l2) U h{l3)) = h{l2 U I3) 



Corollary 3. Let L be an i-generator and S = {si, S 2 , ■ ■ ■ , s^} a set of (i — 1 )- 
subsets of L where U.C5 s = I. Lf 3 sGS such as support(s) = support(L), then 
h{L) = h{s). 

Proof. Derived from Lemma 2. 

The AC-Generator function works as follows. We first apply the combinato- 
rial phase of Apriori-Gen j2j to the set of generators Gi in order to obtain a set 
of candidate (i-l-l)-generators: two generators of size i in Gi with the same first 
i — 1 items are joined, producing a new potential generator of size i -I- 1 (step 1 
to 4). Then, the potential generators produced that will lead to useless com- 
putations (infrequent closed itemsets) or redundancies (frequent closed itemsets 
already produced) are pruned from as follows. 

First, like in Apriori-Gen, is pruned by removing every candidate (i-|-l)- 
generator c such that some Asubset of c is not in Gi (step 8 and 9). Using this 
strategy, we prune two kinds of itemsets: first, all supersets of infrequent gener- 
ators (that are also infrequent according to Property 2); second, all generators 
that have the same support as one of their subset and therefore have the same 
closure (see Theorem 3). Let’s take an example. Suppose that the set of frequent 
closed itemsets G2 contains the generators AB, AC. The AG-Generator function 
will create ABC = AB U AC as a new potential generator in G3 and the first 
pruning will remove ABC since BC ^ G2. 

Next, the supports of the remaining candidate generators in G^+i are de- 
termined and, based on Property 2, those with support less than minsup are 
deleted from G^+i (step 7). 

The third pruning strategy works as follows. For each candidate generator 
c in Gi+i, we test if the support of one of its i-subsets s is equal to the sup- 
port of c. In that case, the closure of c will be equal to the closure of s (see 
Gorollary 3), so we remove c from G^+i (step 10 to 13). Let’s give another 
example. Suppose that the final set of generators G2 contains frequent gen- 
erators AB, AC, BC and their respective supports 3,2,3. The AG-Generator 
function will create ABC = AB U AC as a new potential generator in G3 
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Algorithm 3 AC-Generator function 

1) insert into Gi+\ 

2) select p.itemi, p.item 2 , . . . , p.itemi, q.itemi 

3) from Gi p, Gi q 

4) where p.itemi = g.itemi, . . . , p.itenii_i = g.itemi_i, p.itenii < g.itenii; 

5) forall candidate generators c £ Gi+i do begin 

6) forall i-subsets s of c do begin 

7) if (s ^ Gi) then delete c from Gi+i; 

8) end 

9) end 

10) Gi+i <— Support-Count(Gi+i); 

11) forall candidate generators c £ Gi+i do begin 

12) if (support(c) < minsup) then delete c from Gi+i; // Pruning infrequent 

13) else do begin 

14) forall i-subsets s of c do begin 

15) if (support(s) = support(c)) then begin 

16) delete c from Gi+i; 

17) if (level — 0) then level <— i; // Iteration number of the first prune 

18) endif 

29) end 

20) end 

21) end 

22) Answer <— U £ Gi+i}; 



and suppose it determines its support is 2. The third prune step will remove 
ABC from G 3 since support(ABC) = support(AG). Indeed, we deduce that 
closure(ABC) = dosure(AC) and the computation of the closure of ABC is use- 
less. For the optimization of the generator closure computation in Algorithm^ 
we determine the iteration at which the second prune suppressed a generator 
(variable level). 

4.5 AC-Closure Function 

The AC-Closure function takes the set of frequent generators C, for which clo- 
sures must be determined, as argument. It updates G with, for each generator 
p G G, the closed itemset p. closure obtained by applying the closure operator 
h to p. Algorithm 0 gives the pseudo-code of the function. The method used to 
compute closures is based on Proposition 1. 

Proposition 1. The closed itemset h(I) corresponding to the closure by h of 
the itemset I is the intersection of all objects in the database that contain I: 

h{I)= r\{f{{o})\ICf{{o})} 

oGO 



Proof. We define H = rioGs/({'^}) where S' = {o G O | / C /({o})}. We have 
HI) = f{g{I)) = rioGsC) H{o}) = rioGS' /({o}) where S' = {o G O | o G g(I)}. 



410 Nicolas Pasquier et al. 



Let’s show that S' = S: 

ICf({o})^oGg{I) 
o G g{I) ^ I C f{g{I)) C f{{o}) 

We conclude that S = S', thus h{I) = H. 



Algorithm 4 AC-Closure function 

1) forall objects o G O do begin 

2) Go <— Subset(G. generator, /({o})); // Generators that are subsets of f({o}) 

3) forall generators p £ Go do begin 

4) if (p. closure = 0) then p. closure <— /({o}); 

5) else p. closure <— p. closure n /({o}); 

6) end 

7) end 

8) Answer ^ [j {p € G \ ^p' G G, closure(p')=closure(p)}; 



Using Proposition 1, only one database pass is necessary to compute the closures 
of the generators. The function works as follows. For each object o in T), the 
set Go is created (step 2). Go contains all generators in G that are subsets 
of the object itemset /({o}). Then, for each generator p in Go, the associated 
closed itemset p. closure is updated (step 3 to 6). If the object o is the first one 
containing the generator, p. closure is empty and the object itemset /({o}) is 
assigned to it (step 4). Otherwise, the intersection between p. closure and the 
object itemset gives the new p. closure (step 5). At the end, the function returns 
for each generator p in G, the closed itemset p. closure corresponding to the 
intersection of all objects containing p. 



4.6 Example and Correctness 

FigureElgives the execution of A-Close for a minimum support of 2 (40%) on the 
data mining context T> given in Figure^ First, the algorithm determines the set 
Gi of 1-generators and their support (step 1 and 2), and the infrequent generator 
D is deleted form Gi (step 3 to 5). Then, generators in G2 are determined by 
applying the AC-Generator function to Gi (step 8) : the 2-generators are created 
by union of generators in Gi , their support is determined and the three pruning 
strategies are applied. Generators AC and BE are pruned since support(AG) = 
support(A) and support(Sif) = support(i?), and the level variable is set to 2. 

Galling AG-Generator with G2 produces 3-generators in G3. The only gen- 
erator created in G3 is ABE since only AB and AE have the same first item. 
The three pruning strategies are applied and the second one removes ABE form 
G3 as BE ^ G2. Then, G3 is empty and the iterative construction of sets Gi 
terminates (the loop in step 7 to 9 stops). 
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Fig. 2. A-Close frequent closed itemset discovery for minsup = 2 (40%) 



The sets G and G' are constructed using the level variable (step 10 and 11): G 
is empty and G' contains generators from Gi and Ga- The closure function AC- 
Closure is applied to G' and the closures of all generators in G' are determined 
(step 15). Finally, duplicates closures are removed from G' by AC-Closure and 
the result is returned to the set FG which therefore contains AG,BE,G,ABCE 
and BGE, that are all frequent closed itemsets in T>. 

Lemma 3. For p C I such as ||p|| > 1, if p ^ G\\p\\ and support(p) > minsup 
then 3si,S 2 C X, si C S 2 C p and ||si|| = ||s 2 || — 1 such as h{si) = h{s 2 ) and 
Si G G||^q|. 

Proof. We show this using a recurrence. For ||p|| = 2, we have p = S2 and 
3si G Gi I Si C S 2 and support(si) = support(s 2 ) =1* h{si) = h{s 2 ) (Lemma 3 
is obvious). Then, supposing that Lemma 3 is true for ||p|| = i, let’s show that 
it is true for ||p|| = i + 1. Let p \ ||p|| = i + 1 and p ^ G||p|| . There are two 
possible cases: 
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(1) 3p' c p I IIp'II = i and p' ^ Gup-n 

(2) 3 p' C p I IIp'II = i and p' G G||p/|| and support(p) = support(p') h{p) = 

h{p') (Lemma 2) 

If (1) then according to the recurrence hypothesis, 3si C S2 Q p' C p such as 

h{si) = h{s2) and si G G||sj||. If (2) then we identify si to p' and S2 to p. 



Theorem 3. The A-Close algorithm generates all frequent closed itemsets. 

Proof. Using a recurrence, we show that Vp C I | support (p) > minsup we have 
h{p) G FC. We first demonstrate the property for the 1-itemsets: Vp C I where 
IIpII = 1, if support(p) > minsup then p G Gi h{p) G FC. Let’s suppose 
that yp C T such as ||p|j = i we have h{p) G FC. We then demonstrate that 
Vp C X where ||p|| = i + 1 we have h{p) G FC. If p G G||p|| then h{p) G FC. 
Else, if p ^ ^IIpII according to Lemma 3, we have: 3si C S 2 C p | si G 
G||s,^ll and h{s\) = h{s2). Now h{p) = h{s2 U p — S 2 ) = h{s\ U p — S 2 ) and 
II Si Up — «2 II = *, therefore in conformity with the recurrence hypothesis we 
conclude that h{s\ Up — S 2 ) G FC and so h{p) G FC. 

5 Experimental Results 

We implemented the Apriori and A-Close algorithms in C++, both using the 
same prefix-tree structure that improves Apriori efficiency. Experiments were 
realized on a 43P240 bi-processor IBM Power-PC running AIX 4.1.5 with a CPU 
clock rate of 166 MHz, 1GB of main memory and a 9GB disk. Each execution 
uses only one processor (the application was single-threaded) and was allowed a 
maximum of 128MB. 



Test Data We used two kinds of datasets: synthetic data, that simulate market 
basket data, and census data, that are typical statistical data. The synthetic 
datasets were generated using the program described in |2| . The census data were 
extracted from the Kansas 1990 PUMS file (Public Use Microdata Samples), 
in the same way as jS| for the PUMS file of Washington (unavailable through 
Internet at the time of the experiments) . Unlike in [3| though, we did not put an 
upper bound on the support, as this distorts each algorithm results in different 
ways. We therefore took smaller datasets containing the first 10,000 persons. 



Parameter 


T10I4D100K 


T20I6D100K 


C20D10K 


C73D10K 


Average size of the objects 


10 


20 


20 


73 


Total number of items 


1000 


1000 


386 


2178 


Number of objects 


lOOK 


lOOK 


lOK 


lOK 


Average size of the maximal poten- 
-tially frequent itemsets 


4 


6 


- 


- 



Table 2. Notation 
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Results on Synthetic Data Figure 0 shows the execution times of Apriori 
and A-Close on the datasets T10I4D100K and T20I6D100K. We can observe 
that both algorithms always give similar results except for executions with min- 
sup — 0.5% and 0.33% on T20I6D100. This similitude comes from the fact that 
data are weakly correlated and sparse in such datasets. Hence, the sets of gener- 
ators in A-Close and frequent itemsets in Apriori are identical, and the closure 
mechanism does not help in jumping iterations. In the two cases where Apriori 
outperforms A-Close, there was in the 4*^ iteration a generator that has been 
pruned because it had the same support as one of its subsets. As a consequence, 
A-Close determined closures of all generators with size greater or equal than 3. 





Execution times on T20I6D100K 
Fig. 3. Performance of Apriori and A-Close on synthetic data 



Execution times on T10I4D100K 



Results on Census Data Experiments were conducted on the two census 
datasets using different minsup ranges to get meaningful response times and 
to accommodate with the memory space limit. Results for the C20D10K and 
C73D10K datasets are plotted on Figure0andQrespectively. A-Close always sig- 
nificantly outperforms Apriori, for execution times as well as number of database 
passes. Here, contrarily to the experiments on synthetic data, the differences be- 
tween execution times can be measured in minutes for C20D10K and in hours for 
C73D10K. It should furthermore be noted that Apriori could not be run for min- 
sup lower than 3% on C20D10K and lower than 70% on C73D10K as it exceeds 
the memory limit. Census datasets are typical of statistical databases: highly 
correlated and dense data. Many items being extremely popular, this leads to a 
huge number of frequent itemsets from which few are closed. 

Scale Up Properties on Census Data We finally examined how Apriori and 
A-Close behave as the object size is increased in census data. The number of 
objects was fixed to 10,000 and the minsup level was set to 10%. The object size 
varied from 10 (281 total items) up to 24 (408 total items). Apriori could not be 
run for higher object sizes. Results are shown in Figure El We can see here that, 
the scale up properties of A-Close are far better than those of Apriori. 



414 Nicolas Pasquier et al. 




Execution times Number of database passes 

Fig. 4. Performance of Apriori and A-Close on census data C20D10K 




Execution times Number of database passes 

Fig. 5. Performance of Apriori and A-Close on census data C73D10K 




Fig. 6. Scale-up properties of Apriori and A-Close on census data 
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6 Conclusion 

We presented a new algorithm, called A-Close, for discovering frequent closed 
itemsets in large databases. This algorithm is based on the pruning of the closed 
itemset lattice instead of the itemset lattice, which is the commonly used ap- 
proach. This lattice being a sub-order of the itemset lattice, for many datasets, 
the number of itemsets considered will be significantly reduced. Given the set 
of frequent closed itemsets and their support, we showed that we can either de- 
duce all frequent itemsets, or construct a reduced set of valid association rules 
needless the search for frequent itemsets. 

We realized experiments in order to compare our approach to the itemset 
lattice exploration approach. We implemented A-Close and an optimized ver- 
sion of Apriori using prefix-trees. The choice of Apriori leads form the fact that, 
in practice, it remains one of the most general and powerful algorithms. Those 
experiments showed that A-Close is very efficient for mining dense and/or cor- 
related data (such as statistical data): on such datasets, the number of itemsets 
considered and the number of database passes made are significantly reduced 
compared to those Apriori needs. They also showed that A-Close leads to equiv- 
alent performances of the two algorithms for weakly correlated data (such as 
synthetic data) in which many generators are closed. This leads from the adap- 
tive characteristic of A-Close that consists in determining the first iteration for 
which it is necessary to compute closures of generators. Such a way, we avoid 
A-Close many useless closure computations. 

We think these results are very interesting since dense and/or correlated data 
represent an important part of all existing data, and since mining such data is 
considered as very difficult. Statistical, text, biological and medical data are 
examples of such correlated data. Supermarket data are weakly correlated and 
quite sparse, but experimental results showed that mining such data is consider- 
ably less difficult than mining correlated data. In the first case, executions take 
some minutes at most whereas in the second case, executions sometimes take 
several hours. 

Moreover, A-Close gives an efficient unsupervised classification technic: the 
closed itemset lattice of an order is dually isomorphic to the Dedekind-MacNeille 
completion of an order 0, which is the smallest lattice associated with an order. 
The closest work is Canter’s algorithms which work only in main memory. 
This feature is very interesting since unsupervised classification is another im- 
portant problem in data mining jS| and in machine learning. 
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Abstract. We explore a new form of view rewrite called view disas- 
sembly. The objective is to rewrite views in order to “remove” certain 
sub-views (or unfoldings) of the view. This becomes pertinent for com- 
plex views which may defined over other views and which may involve 
union. Such complex views arise necessarily in environments as data 
warehousing and mediation over heterogeneous databases. View disas- 
sembly can be used for view and query optimization, preserving data 
security, making use of cached queries and materialized views, and view 
maintenance. 

We provide computational complexity results of view disassembly. We 
show that the optimal rewrites for disassembled views is at least NP- 
hard. However, we provide good news too. We provide an approximation 
algorithm that has much better run-time behavior. We show a pertinent 
class of unfoldings for which their removal always results in a simpler 
disassembled view than the view itself. We also show the complexity to 
determine when a collection of unfoldings cover the view definition. 



1 Introduction 

Many database applications and environments, such as mediation over hetero- 
geneous database sources and data warehousing for decision support, lead to 
complex view definitions. Views are often nested, defined over previously defined 
views, and may involve unions. The union operator is a necessity in mediation, 
as views in the meta-schema are defined to combine data from disparate sources. 
In these environments, view definition maintenance is of paramount importance. 

There are many reasons why one might want to “remove” cormaonents, or 
sub-views, from a given view, or from a query which involves views jj Let us call 
a sub- view an unfolding of the view, as the view can be unfolded via its definition 
into more specific sub-views. These reasons include the following. 

1. Some unfoldings of the view may be effectively cached from previous queries 
0, or may be materialized views 1 1 2] . 

^ In this paper, we use view and query synonymously. 
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2. Some unfoldings may be known to evaluate empty, by reasoning over the 
integrity constraints 

3. Some unfoldings may match protected queries, which, for security, cannot 
be evaluated for all users P!- 

4. Some unfoldings may be subsumed by previously asked queries, so are not 
of interest to the user. 

What does it mean to remove unfolding from a view or query? The modified 
view or query should not subsume — and thus, when evaluated, should never 
evaluate — the removed unfoldings, but should subsume “everything else” of the 
original view. 

In case0 one might want to separate out certain unfoldings, because they 
can be evaluated much less expensively (and, in a networked, distributed envi- 
ronment, be evaluated locally). Then, the “remainder query” could be evaluated 
separately 0. In case El the unfoldings are free to evaluate, since it is known 
in advance that they must evaluate empty. If the remainder query is less ex- 
pensive to evaluate than the original, this is an optimization. In case El when 
some unfoldings are protected, this does not mean that the “rest” of the query 
or view cannot be safely evaluated. In caseE] when a user is asking a series of 
queries, he or she may just be interested in the stream of answers returning. So 
any previously seen answers are no longer of interest. In environments in which 
there are monetary charges for information, there is an additional advantage of 
not having to pay repeatedly for the same information. 

In this paper, we address this problem of how to remove efficiently and cor- 
rectly sub-views from views. We call this problem view disassembly. We present 
the computational complexities of, and potential algorithmic approaches to, view 
disassembly tasks. On first consideration, it may seem that the view disassembly 
problem is trivial, that a view could always be “trimmed” to exclude any given 
sub-view. On further consideration, however, one quickly sees that this is not 
true. To remove a sub- view, or especially a collection of sub- views, can be a quite 
complex task. Certain unfolding removals do, in fact, abide the first intuition: 
the view’s definition can be trimmed to exclude them. We call these unfoldings 
simple, and we characterize these in this paper. In the general case, however, 
unfoldings require that the view’s definition be rewritten to effectively remove 
them. We are interested in compact rewrites that accomplish this. 

We represent queries and views in Datalog. Thus, we consider databases un- 
der the logic model HSl- For this paper, we do not consider recursion nor negation. 
A database DB is considered to consist of two parts: the extensional database 
(Edb)) a set of atomic facts; and the intensional database (Idb)> a set of clausal 
rules. Predicates are designated as either extensional or intensional. Extensional 
predicates are defined solely via the facts in the Edb ■ Intensional predicates are 
defined solely via the rules in the Idb- Thus, extensional predicates are equiva- 
lent to base relations defined by relational tables in the relational database realm, 
and intensional predicates to relational views. We shall employ the term view to 
refer to any intensional predicate or any query that uses intensional predicates. 
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In [3, we called the notion of a view or query with some of its imfoldings 
(sub-views) “removed” a discounted query or view, and called the “removed” 
unfoldings the unfoldings-to-discount. A view (or query) can be represented as 
an AND/OR tree that represents the expansion of its definition via the rules 
in the Idb- In view disassembly, we consider algebraic rewrites of the view’s 
corresponding AND / OR tree to find AND / OR trees that properly represent the 
discounted view. Consider the following example. 

Example 1. Let there be six relations defined in the database DB: 

— Departments (did, address) 

— Institutes (did, address) 

— Faculty (eid, did, rank) 

— Staff (eid, did, position) 

— HealthJEns (eid, premium, provider) 

— Life_ins (eid, premium, provider) 

Let there also be three views defined in terms of these relations: 

academic-units {X, Y) departments {X, Y). 

academic-units{X, Y) <— institutes {X , Y). 
employees {X, Y) faculty {X, Y, Z) . 
employees {X, Y) ^ staff {X, Y, Z). 
benefits {X, Y, Z) <— healthXns {X, Y, Z). 
benefits {X, Y, Z) <— life-ins {X, Y, Z). 

Define the following query Q that asks for addresses of all academic units 
with any employees receiving benefits from jEtna^ 

Q: q{Y) ^ a{X, Y), e{Z, X), b{Z, W, a;tna). 




Fig. 1. The AND / OR tree representation of the original query. 



Query Q can be represented as parse tree of its relational algebra representa- 
tion, which is an AND/OR tree, as shown in Figure 0 Evaluating the query — in 
the order of operations as indicated by its relational algebra representation — is 

^ We henceforth abbreviate the predicate names. 
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equivalent to materializing the nodes of the query’s AND/OR tree. We refer to 
this type of evaluation (and representation) as bottom-up. Consider the following 
three scenarios: 

Case 1. Assume that the answers of the following three queries T\, T 2 , and 
have been cached. Equivalently, we could assume that these represent material- 
ized views, or that they are known to evaluate empty (by reasoning over integrity 
constraints). Let f^, / 2 , and be the corresponding cache predicates. 

/i (F) ^ o(A, y), f{Z, X, V), b{Z, W, Oitna). 

/ 2 (F) ^ d{X, F), e{Z, A), b{Z, W, a;tna). 

h{Y)^i{X, F), s{Z, X, V), b{Z, W, a^tna). 

Q does not have to be evaluated, since its answer set is equal to the union 
of answer sets of cached queries0 We say that queries tFi, T 2 , and Tz cover the 
query Q. 

Case 2. Assume that the following query has been likewise cached: 

/ 4 (F) ^ d{X, F), e{Z, X), b{Z, W, cEtna). 

As T 4 provides a subset of the answer set to Q, we can rewrite Q as Q to 
retrieve only the remaining answers: 

q'{Y) ^ i{X, F), e{Z, X), b{Z, W, a;tna). 

Note that unless special tools are available to evaluate efficiently the join of 
Academic_Units N Employees, the rewrite of Q to Q provides an optimiza- 
tion. In this case, T 4 is a simple unfolding of Q (defined in Section^. 

Case 3. Assume that the following query has been likewise cached. 

/s (F) ^ d{X, F), f{Z, X, V), 1{Z, W, cBtna). 

Again, we may want to “remove” Tz from the rest of the query, as its answers 
are locally available. One way to do this is to rewrite Q as a union of join 
expressions over base tables, to remove the join expression represented by 
and then to evaluate the remaining join expressions. This can be quite inefficient, 
however. The number of join expressions that remain to be evaluated may be 
exponential in the size of the the collection of view definitions and the query. 
Furthermore, we showed in 0 that such an evaluation plan (which we call a 
top-down evaluation plan) may require evaluating the same joins multiple times, 
and incur the expense that a given answer tuple may be computed many times 
(whenever base tables overlap). A top-down evaluation of Q from Figure ^ is the 
union of the eight join expressions from 

{d,i} X {/,s} X {hj} 

® We assume set semantics for answer sets in this paper. 
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A more efficient evaluation plan, perhaps, for the discounted query can be 
devised by rewriting the query so that the number of operations (unions plus 
joins) is minimized. (See Figure |3). As a side effect of this operation, the redun- 
dancy in join evaluation, as well as the redundancy in answer tuple computation, 
is reduced 00 

Clearly, there are many factors that are more important involved in estimat- 
ing whether one query will evaluate faster than another than merely the size 
of the query expression. It is quite likely that some well chosen, larger query 
expression will evaluate faster than the minimum sized expression of the query. 
However, succinctness of the query expression is an important component which 
cannot be ignored. Especially when alternative expressions can be exponentially 
larger than the minimal expression, controlling the succinctness of the query 
expression is vital. We focus on the succinctness issue in this paper. 




Fig. 2. The AND/OR tree representation of the modified query. 



As in Case 1 above, it may be that the unfoldings-to-discount (removed sub- 
views) cover the view. This means that the discounted view is equivalent to the 
null view (defined to evaluate empty). We present the complexity of deciding 
coverage in Section 0 In Section E] we show that there are natural cases when 
the view can be always rewritten into a simpler form. This is true whenever an 
unfolding-to-discount is simple, as in Case 2 above. In such cases, the rewrite is 
always an algebraic optimization. In Section 0 we consider the general case of 
rewriting a view in an algebraically absolute optimal way, as we did in Case 3 and 
FigureElabove. Our goal is to find good view disassemblies; that is, rewrites that 
result in small AND/OR trees. We aim to optimize over the number of nodes of 
the resulting AND/OR tree. (An ultimate goal might be to find a rewrite that 
results in the smallest AND/OR tree possible.) We show that the complexity of a 
sub-case of this task (a sub-class of fairly simple views) is NP -complete over the 
size of the view’s AND/OR tree. We show the general problem is even harder. In 
Section 0 we explore approximate optimality. We motivate a rewrite algorithm 

^ We have a naive algorithm that can find the optimal rewrite in the case of a single 
unfolding-to-discount. Such a result, even for this small example, would be rather 
difficult to find by hand. 
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that produces a disassembled view (equivalent semantically to the discounted 
view expression) for which the complexity is over the number of unfoldings-to- 
discount, and not over the size of the view’s AND/OR tree. Hence, this approach 
is tractable, in general, and can result in rewrites that are reasonably compact. 

We do not address in this paper the issue of determining when one query 
semantically overlaps with another query; that is, we assume that the unfolding 
that represents the overlap has already been identified. This, of course, may not 
be a trivial task. For example, queries J^i, and from Example ^might not 
have been the original cached queries themselves, but instead could have been 
constructed from them. 



2 Related Work 

The work most closely related to view disassembly is m The authors consider 
queries that involve nested union operations, and propose a technique for rewrit- 
ing such queries when it is known that some of the joins evaluated as part of 
the query are empty. The technique in HH applies, however, only to a class of 
simple queries, and no complexity issues are addressed. 

Another research area related to view disassembly is multiple query optimiza- 
tion (MQO) 1 1 ,4] . The goal in multiple query optimization is to optimize batch 
evaluation of a collection of queries, rather than just a single query. The tech- 
niques developed for MQO attempt to find and reuse common sub-expressions 
from the collection of queries, and are heuristics-based. We do not expect that 
the MQO techniques could result in the rewrites we propose in this paper. We 
can exploit the fact that our rewrites involve unfoldings that all come from the 
same view. 

The problem of query tree rewrites for the purpose of optimization has been 
also considered in the context of deductive databases with recursion. In jSj, the 
problem of detecting and eliminating redundant subgoal occurrences in proof 
trees generated by programs in the presence of functional dependencies is dis- 
cussed. In uni, the residue method of P is extended to recursive queries. 

In introduced a framework we call intensional query optimization 

which enables rewrites to be applied to non-conjunctive queries and views (that 
is, ones which involve union). An initial discussion of complexity issues and 
possible algorithmic solutions appear in jSj. In [Z|, we present an algorithm which 
incorporates unfolding removal into the query evaluation procedure. Hence the 
method in 0 is not an explicit query rewrite. 

Our work in view disassembly is naturally related with all work on view and 
query rewrites. However, most all work in view rewrites strives to find views that 
are semantically equivalent with, or contained in, the original. View disassembly 
does not. Rather, we are using rewrites in order to remove implicitly components 
(unfoldings) from the original view. This is a much different goal than that of 
previous view rewrite work, and so this requires a different treatment. Aside 
from the work listed above, we are not aware of any work on view rewrites that 
bears directly on view disassembly. 
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3 Discounting and Covers 

We define a view (and, likewise, a query) to be a set of atoms. For instance, 
{a,e,b} represents the query/view in Example Some of the atoms may be 
intensional; that is, they are written with view predicates defined over base table 
predicates and, perhaps, other views. So since some of atoms may be intensional, 
the corresponding AND/OR tree (with respect to the Idb’s rules) may involve 
both joins (ANDs) and unions (ORs). 

We provide a formal definition for an unfolding of a query. 

Definition 1. Given query sets Q andU, calllA a 1-step unfolding of query set 
Q (denoted hyU Q) with respect to database DB iff, given some q^ G Q and 
a rule {a <— &i, . . . , &„.) in Idb such that q^0 = o0 (for most general unifier 9), 
then 



U= (Q-{gJU {&!,..., &„})6i 



Call Ui simply an unfolding of Q, written as U\ < Q, iff there is some finite 
collection of query sets U\,. . .,Uk such that Ui <^ ... <^ Uk Q. 

An unfolding U is called extensional iff, for every q^ G U, atom q^ is written 
with an extensional predicate. Call the unfolding intensional otherwise. 

One of the 1-step unfoldings of the query in Example ^is {a, e,/}. One of 
the extensional unfoldings of the query is {d, /, h}. 

An unfolding of a view Q can be marked in Q’s AND/OR tree. In Example 0 
below, the three unfoldings of the view Q considered are shown marked (by labels 
1, 2, and 3) in Q’s AND/OR tree in Figure 0 In essence, one can think of each 
unfolding’s own AND/OR tree as embedded in the view’s AND/OR tree. It is 
actually these embedded AND/OR trees, rather than the unfoldings themselves, 
that we seek to remove by rewriting syntactically the view’s AND / OR tree. 




Fig. 3. An AND/OR tree with repeated elements. 

® We ignore the ordering of the atoms in the query, without loss of generality. We also 
only present “propositional” examples for simplicity’s sake. It should to be clear how 
the examples and techniques apply when the view’s variables are explicitly presented. 



424 Parke Godfrey and Jarek Gryz 



When the unfoldings-to-discount correspond uniquely to embedded AND/OR 
trees in the AND/OR tree of the view to be discounted (as they do in Example 0 
below), this distinction between unfoldings — a semantic notion as are queries — 
and the (embedded) AND / OR trees that represent them — a syntactic notion — is 
unimportant. However, the correspondence is not always unique (and, therefore, 
not always straightforward). This can happen if the same atom appears mul- 
tiple times in the view’s definition (and, hence, AND/OR tree). Consider the 
AND/OR tree in Figure^ Does unfolding {a, b, . . .} correspond to the embed- 
ded AND/OR tree marked by (1) or that marked by (2)? Or does it correspond 
to both? 

To avoid this complication for the sake of this paper, we avoid these ambi- 
guities. We limit ourselves to a sub-class of views for which these ambiguities 
do not arise: views for which no atom is employed twice in its definition. For 
this sub-class, the correspondence of unfoldings to embedded AND/OR trees is 
unique and straigtforward. Since the complexity results we derive in this paper 
are with respect to this sub-class of views, they provide a legitimate lower bound 
on the complexity on the view disassembly problems for the class of all views. 
Certainly, a view query with repetition in its definition presents further oppor- 
tunities for optimization via rewrites to remove some of the duplication, but we 
do not consider these rewrite issues here. 

Let Q be a view and Ui, . . .,Uk be unfoldings of Q. We define the nota- 
tion Q\{Ui, . . . ,Uk} to be a discounted view of Q with unfoldings-to-discount 
Ui,. . .,Uk- Define unfolds(Q) to be the set of all extensional unfoldings of Q. 
The meaning of Q\{Ui, . . . ,Uk} is intended to be: 

k 

unfolds (Q) — ( U unfolds (Ui)) 

i=l 

The first case we ought to consider is when the set of extensional unfoldings 
of the discounted view is empty. In such a case, we say that the unfoldings- 
to-discount — that is, the unfoldings that we effectively want to remove — cover 
the view (or query). The degenerate case is Q\{Q}. At the opposite end of 
the spectrum is Q\unfolds (Q). When a discounted view is covered, the most 
succinct disassembled view is the null view, which we define to evaluate to the 
empty answer set. Thus, we are interested in how to test when a discounted 
view is covered. As it happens, there are interesting, and unobvious, cases of 
discounted views which turn out to be covered. Furthermore, cover detection is 
computationally hard. 

Example 2. T\, T 2 , and represent three unfoldings of Q in Exampled Figure 

0 shows El, E 2 , and marked in the tree from Figure 0 of Example Q Since 
3 

unfolds (Q) C y unfolds (iFi) the set {Ei,E 2 ,E^} is a cover of Q. 

We establish that determining that a discounted view is covered is coNP- 
complete over the number of unfoldings-to-discount. For a set-theoretic version 
of this problem that is appropriate to establish computational complexity, the 
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1 h 



Fig. 4. Cover of the AND/OR tree in Example E] 



input can be considered to be the view’s AND / OR tree and the AND / OR trees 
of the unfoldings-to-discount, and the question is whether the view’s AND/OR 
tree is covered by the AND/OR trees of the unfoldings-to-discount. 

Definition 2. A discounted view instance V is a pair of an AND/OR tree 
(which represents the view, and which contains no duplicate atoms) and a list of 
AND/OR trees (which correspond to the unfoldings-to-discount, and which can 
he marked uniquely as embedded AND/OR trees in the view’s AND/OR tree). 
Define COV as the set of all discounted view instances that are covered. 



Theorem 1. COV is co^NP -complete. 

Proof. By reduction from 3-SAT. □ 

It is perhaps a surprising result that the complexity of deciding the cover 
question for discounted views is dictated by the number of unfoldings-to-discount, 
and not by the size of the view. This is still bad whenever one has many 
unfoldings-to-discount to consider. The good news is that the intractability is 
independent of the view definition’s complexity. Furthermore, the news for cover 
detection seems good, in practice. Often, the number of unfoldings being consid- 
ered is manageably small, so the cover check is tractable. In addition, we have em- 
pirically observed P that average case for the cover check appears tractable even 
for significantly more unfoldings-to-discount, even though worst-case is known 
to be intractable. 

Thus, the first step in view disassembly should be to check whether the 
discounted view is covered. We investigate next what can be done when it is 
not. 



4 Simple Unfoldings 

A disassembled view may cost more to evaluate than the original view. A degen- 
erate case is, of course, the case of the disassembled view that is the collection 
of all the extensional unfoldings. In general, it cannot be guaranteed that the 
AND/OR tree for a best disassembled view would be more compact (hence would 
require fewer operations to evaluate) than the original view. (Case 3 of Example 
n demonstrated this.) In this section, we define a type of unfolding for which 
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discounting is guaranteed to produce an AND/OR tree that is more compact 
than the AND/OR tree of the original view. We call such unfoldings simple. 

The intuition behind the concept of a simple unfolding is as follows. We strive 
to define an unfolding whose discounting from a query amounts to a removal of 
one or more nodes from the query’s AND/OR tree without rewriting the rest 
of the tree. In general, removal of a node from a query tree is equivalent to 
discounting more than a single unfolding because such a node represents an 
atom from more than one unfolding. Thus, removal of a node from a query tree 
amounts to discounting all unfoldings that contain the atom represented by the 
removed node. Hence, if there is a single unfolding U that subsumes all unfoldings 
affected by a removal of node Af but no other unfoldings, we are guaranteed that 
removing Af from a query tree of Q results in a query tree for Q\{U}. 

Let us define first the concept of a choice point atom. A choice point atom is 
an atom (in a query or view) that refers to a view defined with a union operator 
(at the top level). 

Definition 3. Let Q = • • • > Qn} ® view. Then is called a choice point 

atom iff there is more than one rule {a ^ 6i, . . . , &„.) such that qfi = oB, for 
most general unifier 9. 

All atoms in query Q of Example ^ (that is, a, e, and h) are choice point 
atoms. 

Definition 4. Let Q be a query and Uq he an unfolding of Q such that Uq 
U i <^ ... <^ Uk = Q. Then, IAq is a simple unfolding of Q iff, for all i < k, the 
set lAi — Uq contains at most one choice point atom. 



Example 3. Let the query Q be as in Example ^ but now assume that Depart- 
ments is also a view defined as follows. 

d{Y, V) ^ di (A, Y, Z), d2 {X, V, W). 
di {X, Y, Z) ^ di,i (A, Y, Z). 
di (A, Y, Z)^ di’a \x, Y, Z). 

The query tree for Q in this new database is shown in Figure 0 

Consider the following unfoldings of Q: 

U\: {d, e, b} 
ld.2'- {di,i, c? 2 , e, b} 

U3: {dj,b} 

Unfolding Ui is simple, since Ui Q and Q — Ui = {a}. Removing node d 
(with the entire subtree rooted at d) produces a tree which contains all unfoldings 
of the original query Q except for the unfolding Ui (and all unfoldings subsumed 
by Ui). In other words, the new tree represents the query Q\{Ui\. Note that 
this is clearly a case of optimization. The new tree has fewer operations than 
the original tree. 



View Disassembly 427 




Fig. 5. The AND/OR tree representation of the query of Example 0 



Unfolding U2 is simple, since U2 U\ Q, for which Ui = 

{di, c?2, e, 6}, and we know — U2 = {di}, U\ — IA2 = {d}, and Q — U2 = {a}. 
Similarly to the case discussed above, it is sufficient to remove node di^ to 
produce a tree for Q\{f^2}; no other rewrite is necessary. 

Consider unfolding U3. Now, Q — U‘i = {e, h}. Since both e and b are choice 
point atoms, the unfolding is not simple. This case illustrates the intuition behind 
Definition 0 Since Q — U3 = {e, 6}, then there must be at least two atoms in U3 
(they are d and / in this case) that lie under a and e respectively. Since both a 
and e are choice point atoms, d and / each must have siblings (they are i and 
s, respectively, in this case). Consider removing d or / (or both) to generate a 
tree for Q\{hl'i\. Removing d from the tree means that an unfolding {d, s,b} is 
also removed; removing / means that {i, /, b} is removed as well. Thus, it is not 
possible to produce a tree for Q/jiYs} by simple node removal. 

It is easy to show that all 1-step unfoldings are simple (Unfolding Ui in 
Example 0 is a 1-step unfolding.) 

We can now state an optimization theorem. 

Theorem 2. Let Q be a query and Uq be a simple unfolding of Q. Then, an 
AND/OR tree for Q\{lA/\ can be produced by removing one or more nodes from 
the AND/OR tree for Q. 

Proof. By proof by case. □ 

Simple unfoldings are ideal when the goal of disassembly is optimization (that 
is, the discounted query or view must cost less to evaluate than the original). 
They are easy to detect (by definition) and remove (as shown for Ui above). The 
disassembled view is then guaranteed to be smaller than the original view. This 
means the disassembled view will almost always cost less to evaluate than the 
original view. It can also be shown that the class of simple unfoldings is the only 
type of unfolding that guarantees the existence of a disassembled view that has 
fewer operations than the view. 
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5 AND/OR Tree Minimization 

When the unfoldings-to-discount are not simple, finding an AND / OR tree of the 
disassembled view requires, in general, more than just a pruning of the AND /OR 
tree of the original view. Moreover, as we stated in Section [Q one might want to 
find not just any AND/OR tree, but the most compact one; that is, the one with 
the smallest number of union and join operations with respect to all trees that 
represent the disassembled view. One way to achieve this is to take the union of 
all extensional unfoldings of the view, remove all that are extensional unfoldings 
of any unfolding-to-discount, and minimize (by distributing unions over joins) 
the number of operations in the remaining unfoldings. This procedure is clearly 
intractable since there can be an exponential number of extensional unfoldings 
with respect to the size of the view’s tree. As it is, we cannot do any better. 
Finding the most compact tree is intractable, even for very simple views. In this 
section, we consider a view which is a two-way join over unions of base tables. 
In a sense, this is the simplest class of views for which the minimization problem 
is non-trivial (in that this is the least complex view that contains non-simple 
unfoldings) . 

Let the view Q be: 



q ^ a, b. 

in which a and b are defined as follows: 

a <— oi. b ^ bi- 

a ^ Qn- b ^ bn- 

Define the set of unfoldings-to-discount as follows. Let Ui = . . ., 

= Winbj^}, for ik, jk G 1 < k < 1. Assume, without loss of 

generality, that for every S C {oi, ..., a„}, there is an atom (with the intensional 
predicate) As defined as: As ^ ak for all Ofc G S. We call the collection of all 
such atoms A; that is, 

A = {As I S C {oi, ...,a„} and As ^ ak, for all Ofc G S} 

Similarly define Br for all subsets of {bi , ..., 6„}. Then, the discounted query 
can be evaluated as a union of joins over atoms As and Br- We are interested in 
the maximal pairs of Mg’s and Br’s such that none of the unfoldings-to-discount 
can be found in their cross. (By maximal, it is meant that no super-set of the 
chosen As or no super-set of the Br chosen would also have this property. ) That 
is, given such a pair As and Br, ^ As or bi^ ^ Br, for 1 < k < L Let C be 
the collection of all such maximal, consistent pairs of Mg’s and Br’s- 

unfolds (Q\{Z^i, . . .Z^/}) = U unfolds ({As, Sr}) 

{As,Br)ec 
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Let t be the cardinality of the collection C. This number represents the num- 
ber of trees that are needed to evaluate the discounted query. Let Wi, I < i < t, 
be a query (unfolding) represented by each such tree and op{Wi) be the total 
number (that is, a single join and all unions )0 of operations required to evaluate 

t t 

Wi- Then, op{ IJ Wi) = ( IJ op{Wi)) +t — 1. We also require, without loss of 

2—1 i—1 

generality, that W”s do not overlap; that is, there does not exist an unfolding U 
such that U < Wi, U < Wj, and i ^ j. 

We can now state the problem of Minimization of Discounted Query as fol- 
lows. 

Definition 5. Define the class Minimization of Discounted Query (MDQj as 
follows. An instance is the triplet of a query set Q, a eollection of unfoldings-to- 
discount lA, and a positive integer K . An instance belongs to MDQ iff there is 

t 

a collection of unfoldings Wi,...,Wt defined as above, such that op{ IJ Wi) < K. 



Theorem 3. Minimization of Discounted Query (MDQj is NP -compfete. 

Proof. By reduction from a known NP-/iard problem, minimum order partition 
into bipartite cliques P]. □ 

The 'NP -completeness result can be trivially generalized to the case where a 
and b in the query Q are defined through different numbers of rules. 

The minimization of more complex queries does not remain NP- complete. 
Consider a query Q={pi, ...,p„}, where each of pfs is defined through multiple 
rules as: (pj <— pj) for 1 < j < ki. Since the number of extensional unfoldings 
for this query is exponential in the size of the original query, verifying the solution 
cannot be, in general, done in polynomial time. It can be shown, however, that 
minimization of an arbitrarily complex query — that is, a query for which its 
AND/OR tree representation is arbitrarily deep (an alternation of ANDs and 
ORs) — is in the class TTf . 

Theorem 4. Let Q be an arbitrarily complex query. Then, minimization of Q 
is in TJf . 

Proof. It is easy to show that minimization of query’s operations is equivalent 
to minimization of operations in a propositional logic formula, where joins are 
mapped to conjunctions and unions are mapped to disjunctions. This problem, 
called Minimum Equivalent Expression, is known to be in Illf Pj. □ 

We conjecture that it may be complete in this class. 

® It is easy to show that the optimal tree for the two-way join view constructed above 
mnst be a union of two level trees also. This means that this is a legitimate sub-class 
of the absolute optimization problem for view trees, thus the general problem is at 
least lAP-hard. 
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6 Approximation Solutions 

In the previous section, we showed that to find an algebraic rewrite for view 
disassembly which optimizes absolutely the number of algebraic operations (that 
is, the size of the AND/OR tree) is intractable. In this section, we investigate 
an approximation approach. The premise of the approach is to not rewrite the 
original view’s AND/OR tree, but rather to find a collection of unfoldings of the 
view which “complete” the cover with respect to the unfoldings-to-discount. 

This collection, call it C, should have the following properties. Let JV be the 
set of unfoldings-to-discount. 

1. A/"U C should be a cover of the view; that is, any extensional unfolding of 
the view is also an unfolding of some unfolding in Af or C. 

2. No two unfoldings in C should overlap; that is, for U, V £ C {U ^ V), U and 
Rhave no unfolding in common. (Call U and V pair-wise independent in this 
case.) 

3. Set C should be most general: 

a. no unfolding in C can be refolded at all, and still preserve the above 
properties; and 

b. for any U gC, {Af U C) — { 17} is not a cover of the view. 

We present an algorithm to accomplish such a rewrite called the unfold/refold 
algorithm (Algorithmic. It works as follows. First, find an extensional unfolding 
which is not subsumed by any of the unfoldings-to-discount or any unfolding in 
the set C (that is, the unfoldings generated so far). The routine new-unf aiding 
performs that step. (This is the co-problem of determining cover. Thus, the 
difficulty depends on the size of the collection of the unfoldings-to-discount.) 
Second, refold the unfolding (that is, find a super-unfolding) such that the super- 
unfolding does not subsume any of the unfoldings-to-discount . This is performed 
by the routine refolding^ 

The entire algorithm is then repeated until the unfold step fails; that is, there 
is no such extensional unfolding, meaning that a cover has been established. On 
subsequent cycles, during the refold procedure, the unfolding is refolded only 
insofar as it does not overlap with any unfoldings already in C (condition |2| from 
above) . In the end, parsimonious ensures that the unfolding collection C returned 
is minimal; that is, no member unfolding of C can be removed and leave C still 
as a cover. The call parsimonious {C) can be accomplished in 0(|C|), using the 
results of the last call to new-unfolding which had determined that C was, indeed, 
a cover ^ . 

At completion, the union of the unfoldings produced is a disassembled view. 
It is equivalent semantically to the discounted view. The set of the unfoldings 
unioned with the set of the unfoldings-to-discount is equivalent semantically to 
the original view. 

^ A version of the unfold/refold algorithm is implemented in the Carmin prototype 
0. It efficiently performs the unfold step (hence, the covers test). 
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C:={} 

while new-unf aiding {Q, AfUC, U) 
V := refolding ( U, Af, C) 

C-.= CVJ{V} 
return parsimonious (C) 



Algorithm 1. Unfold/refold algorithm for view disassembly. 



Example 4- Consider again Example^ The algorithm is initialized with C := {}. 
Assume that the first extensional unfolding to consider is V = {i,f,l}. Refold- 
ing V, we arrive at an unfolding {i, e, b}. The next extensional unfolding that 
does not overlap either with {i, e, b} or with the unfolding-to-discount (which 
is {d,f,l} in this case), is V = {d,f,h}. Refolding it produces an unfolding 
{d,e,h} (which is pair-wise independent with {i,e,b} in C from before). The 
last remaining extensional unfolding to consider is V = {d, s, /}. This one cannot 
be refolded any further. The AND/OR query tree representing the most-general 
unfoldings is shown in Figure El 




Fig. 6. The result of unfold/refold algorithm applied to the query and the 
unfolding-to-discount of Example Q1 



Note that the resulting rewrite in the example has eleven nodes (algebraic 
operations) compared with nine nodes of the tree in ExampleQwhich represented 
the most compact rewrite. 

The run-time complexity of the unfold/refold algorithm is dictated by the 
new_unfolding step of each cycle. This depends on the size of the collection of 
unfoldings generated so far plus the number of unfoldings-to-discount. While this 
collection remains small, the algorithms is tractable. Only when the collection 
has grown large does the algorithm tend towards intractable. An advantage of the 
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approach is that a threshold can be set, beyond which the rewrite computation 
is abandoned. On average, we expect the final cover not to be large. 

A variation on the unfold/refold algorithm can be used to find the collection 
of most general unfoldings that are covered by the unfoldings-to-discount. By 
most general, it is meant that no super-unfolding of those found are covered. 
This is a generalization of the check for eover discussed in Section 01 In the 
extreme case, if only a single most-general unfolding is found (that is, the view 
itself), then the discounted view itself is covered. 

Ideally, we would always convert the set of unfoldings-to-discount into this 
most-general form. This most-general collection of unfoldings-to-discount is guar- 
anteed always to be smaller (or the same size, at worst) as the initial collection. 
Thus, it is a better input to the unfold/refold algorithm for view disassembly. 

These techniques avail us many tools for rewriting views and queries for 
a number of purposes. For instance, by finding the most-general unfoldings- 
to-discount, one also identifies all the most-general simple unfoldings that are 
entailed by the unfoldings-to-discount. They are just the simple unfoldings that 
appear in the collection. For view or query optimization, the simple unfoldings 
can be pruned from the AND/OR tree, resulting in a smaller, simpler tree to 
evaluate (as shown in Section El). If we “remove” only the simple unfoldings, 
but not the others, we are not evaluating the discounted view, but something 
in-between the view and the discounted view. However, when our goal is opti- 
mization, this is acceptable. 

Example 5 . Consider removing the following six (extensional) unfoldings from 
the AND/OR tree in Figure Q {p, s,m}, {p,s,w}, {p,t,u}, {p,t,w}, {r, s,m}, 
and {r, s, w}. The most-general unfoldings implied are: {ui, s, U3} and {p, V2, 1:3}. 
Both of these are simple unfoldings, so the original tree can be pruned. 

7 Conclusions and Open Issues 

In this paper, we have defined the notion of a discounted view, which is con- 
ceptually the view with some of its sub- views (unfoldings) “removed” . We have 
explored how to rewrite effectively the view into a form equivalent to a dis- 
counted view expression, thus “removing” the unfoldings-to-discount. We called 
such a rewrite a disassembled view. Disassembled views can be used for opti- 
mization, data security, and streamlining the query/answer cycle by helping to 
eliminate answers already seen. 

View disassembly, as most forms of view and query rewrites, can be compu- 
tationally hard. We showed that optimal view disassembly rewrites is at least 
NP -hard. However, effective disassembled views can be found which are not nec- 
essarily algebraically optimal, but are compact. We explored an approximation 
approach called the unfold/refold algorithm which can result in compact disas- 
sembled views. The complexity of the algorithm is dictated by the number of 
unfoldings-to-discount, and not by the complexity of the view definition to be 
disassembled. Thus, we have shown there are effective tools for view disassembly. 
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We also have identified a class of unfoldings we called simple unfoldings 
which can be easily removed from the view definition to result in a simpler view 
definition. This offers a powerful tool for semantic optimization of views. We also 
established how we can infer when a collection of unfoldings-to-discount cover 
the original view, meaning that the discounted view is void. This result has 
general application, and is fundamental to determine when a view is subsumed 
by a collection of views. 

There is much more work to be done on view disassembly. This includes the 
following. 

— If we are to use view disassembly for optimization, we must study how 
view disassembly can be incorporated with existing query optimization tech- 
niques. Clearly cost-based estimates should be used to help determine which 
view disassemblies appear more promising for the given view, the given 
database, and the given database system. 

— Some applications benefit by view rewrite technology and might further 
benefit by view disassembly technology. However, rewrites are not always 
necessary to accomplish the task. In 0 and in recent work, we have been 
exploring methods to evaluate discounted queries directly, with no need to al- 
gebraically rewrite the query. We have seen with initial experimental results 
that a method we call tuple tagging tends to evaluate discounted queries 
less expensively than the evaluation of the queries themselves. Thus, it is 
necessary to determine which applications need rewrites, and which can be 
handled by other means. 

— The basic approach of the unfold/refold algorithm needs to be extended for 
the additional computations discussed in Section El It is also possible that 
a combined approach of the unfold/refold algorithm with other view rewrite 
procedures might provide generally better performance. 

Sometimes if the original view tree were syntactically rewritten in some 
given, semantically preserving way, a candidate unfolding could be removed 
easily, while it cannot be “easily” removed with respect to the original tree 
as is. We should study how view disassembly could be combined effectively 
with other semantically preserving view rewrite procedures to enable such 
rewrites. 

— A yet better understanding of the profile of view disassembly complexity, also 
with respect to other view rewrite techniques, would allow us to build per- 
haps better view disassembly algorithms. Empirical use of the unfold/refold 
algorithm, for instance, in data warehousing environments, might also pro- 
vide more insight. 
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Abstract. We consider the problem of answering datalog queries using 
materialized views. More specifically, queries are rewritten to refer to 
views instead of the base relations over which the queries were originally 
written. Much work has been done on program rewriting that produces 
an equivalent query. In the context of information integration, though, 
the importance of using views to infer as many answers as possible has 
been pointed out. Formally, the problem is: Given a datalog program V 
is there a datalog program Vv which uses only views as EDB predicates 
and (i) produces a subset of the answers that V produces and (ii) any 
other program V'u over the views with property (i) is contained in Vv^ 
In this paper we investigate the problem in the case of disjunctive view 
definitions. 



1 Introduction 

A considerable amount of recent work has focused on using materialized views 
to answer queries PI Q El o 03 uni i] This issue may arise in several situa- 
tions, e.g., if the relations mentioned in the query are not actually stored or are 
impossible to consult or are very costly to access. The ability to use views is 
important in many applications, including information integration, where infor- 
mation sources are considered to store materialized views over a global database 
schema p iTTUrT^ . 

Suppose that we want to integrate three databases that provide flight in- 
formation. These three databases can be seen as views over the EDB predi- 
cates international -flight{X, Y) and local _flight{X, Y) (which have the mean- 
ing that there is a non-stop international flight from a greek city X to city Y and, 
there is a non-stop local flight from a greek city A to a greek city Y respectively) : 
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vi{X) : — international -flight^ Athens ^X) 
vi{X) : — international-flight{Rhodes,X) 

V2(X,Y) : — local-flight{X, Athens), local-flight{Athens,Y) 

V3{X) : — local-flight{X, Rhodes) 

Suppose that the user is interested in whether there is a way to fly from a 
greek city X to a city Y in another country making at most one intermediate 
connection in a greek city and then a direct international flight i.e. the following 
query V, is asked: 

p{X, Y) : — international -flight{X, Y) 

p(X,Y) : — local -flight{X,Z), international -flight{Z,Y) 

The user does not have access to the databases that contain the facts in the 
predicates internationalJUght and loealJiight but only to the views. Using only 
the materialized views v\, V2 and V3, it is not possible to And all the answers. 
The best one can do is to retrieve all the flights that use either the international 
airport of Athens or the international airport of Rhodes, i.e. ask the following 
query Vv, on the available data: 

p{Athens,Y) : — V2{W, Rhodes), v\{Y) 
p{Rhodes,Y) : — V2{Rhodes,W),v\{Y) 
p{Athens,Y) : — V3{Athens),vi{Y) 
p{X,Y) - V3{X),V2{X,W),V3{Y) 

In fact, the program Vv which is said to be a retrievable program (i.e., it 
contains only the view predicates as EDB predicates), is maximal in the sense 
that any datalog program which uses only the views vi,V2, V3 as EDB predicates 
and produces only correct answers is contained in V. Such a program is said to 
be a retrievable maximally contained program. 

The problem that we consider here is : Given a datalog query and views de- 
fined over its EDB predicates, is there a retrievable maximally contained datalog 
program? 

Previous work on this problem is done in min] where they restricted views to 
being defined by conjunctive queries, whereas in | 3 |Sj disjunctive view definitions 
are considered and is shown how to express a retrievable maximally contained 
program in disjunctive datalog with inequality. In fti® problem of computing 
correct (called certain) answers is considered, also, in the general case where 
views are defined by datalog programs; they investigate the data complexity 
problem under both the closed world assumption and the open world assumption. 

In this paper, we investigate the case where views have disjunctions in their 
definition. We prove the following results: 

a) In the case where both the query and the views are given by non-recursive 
datalog programs, we identify a non-trivial class of instances of the problem 
where there exists a retrievable datalog(yf) program maximally contained in the 
query; in this case, we give an algorithm to obtain it; this program computes 
all correct answers. We know, though, that, in general such a program does not 
exist p. 
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b) When the views are defined by recursive datalog programs and the query 
by non-recursive datalog program, we reduce the problem to that of non-recursive 
view definitions. 

c) In the case the query is given by a recursive datalog program and the views 
by non-recursive datalog programs, we construct a simple disjunctive logic pro- 
gram, which, applied on a view instance, computes all correct answers whenever 
it halts. This result is also obtained in under a similar technique. We also 
show how, in certain cases, this disjunctive program can be transformed into a 
datalog(T^) program thus obtaining the results mentioned in (a) above. 

d) Finally, when both the views and the query are given by recursive pro- 
grams, we prove that it is undecidable whether there exists a non-empty retriev- 
able program Vv Still, we give a criterion that gives a negative answer, in some 
special cases. In fact, the undecidability result holds even in the case of simple 
chain programs m- 

Other related work is done in 0 El where the approaches used es- 

sentially look for rewritings producing equivalent queries. It is known US! that 
the problem of finding a rewriting that produces an equivalent program is NP- 
complete, in the case, where both the query and the views are defined by disjunc- 
tions of conjunctive queries. Other work concern conjunctive queries (the views 
and the question asked) and include [El E| where they give nondeterministic 
polynomial time algorithms to find either a single retrievable query equivalent 
to the given query or a set of retrievable queries whose union is contained in 
the given query Q and that contains any other retrievable datalog query that is 
contained in Q. Special classes of conjunctive queries are examined in 0. 

2 Preliminaries 

2.1 Logic Programs and Datalog 

A disjunctive clause C is a formula of the form US! 

V...VA„ : - 

where m > 1, n > 0 and Ai, . . . , Am, Bi, . . . , Bn are atoms. If n = 0, C is called a 
positive disjunctive clause. If m = 1, C is called a definite clause or Horn clause. 
A definite/disjunctive program is a finite set of definite/disjunctive clauses. A 
Datalog program is a set of function free Horn clauses. A Datalog program is 
a set of function free Horn clauses whose bodies are allowed to contain atoms 
whose predicate is the built-in inequality predicate (y^). The left hand side of a 
rule is called the head of the rule and the right hand side is the body of the rule. 
A predicate is an intensional database predicate, or IDB predicate, in a program 

V if it appears at the head of a rule in V , otherwise it is an extensional database 
(EDB) predicate. A conjunctive query is a single non-recursive function-free Horn 
rule. 

Let Dhe & finite set. A database over domain Z? is a finite relational structure 

V = {D,ri, . . . ,Vn), where each is a relation over D] a substructure of a 
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database is a relational structure {D' , , r^), where each r' contains a subset 

of the facts contained in r.i . 

A query is a function from databases (of some fixed type) to relations of fixed 
arity over the same domain; it has to be generic, i.e., invariant under renamings 
of the domain. 

Datalog programs may be viewed as a declarative query language with the 
following semantics. Let I? be a database, thought of as a collection of facts about 
the EDB predicates of a program V. Let Qp p(2^) be the collection of facts about 
an IDB predicate p that can be deduced from T> by at most k applications of 
the rules in 7^. If we consider p the goal predicate (or output predicate), then V 
expresses a query Qv,p, where 

QvAV) = y Q>^JV) 
k>0 

We will sometimes write Q-p or Vp{T>) instead of Qv,p- 

The above definition also gives a particular algorithm to compute Qv,p'- ini- 
tialize the IDB predicates to be empty, and repeatedly apply the rules to add 
tuples to the IDB predicates, until no new tuples can be added. This is known as 
bottom-up evaluation. A derivation tree is a tree depicting the bottom up eval- 
uation of a specific derived fact in the IDB. Its nodes are labeled by facts and 
each node with its children correspond to an instantiation of a rule in V CHI. 

Given a datalog program, we can define a dependency graph, whose nodes 
are the predicate names appearing in the rules. There is an edge from the node 
of predicate p to the node of predicate p' if p' appears in the body of a rule 
whose head predicate is p. The program is recursive if there is a cycle in the 
dependency graph. We define, also, the adorned dependency graph as follows: We 
assign labels (possible none or more than one) on each edge: Suppose there is a 
rule with head predicate symbol p and predicate symbol p' appears somewhere 
in the body; suppose that the i-th argument of p is the same variable as the j-th 
argument of p'. Then, on edge e = {p,p') we assign a label I =i j\ we call the 
pair (e, 1) a p-pair. We define a perfect cycle in the adorned dependency graph to 
be a sequence of p-pairs (ei, ^i), . . . , (cfc, Ik) such that the edges ei, . . . , form 
a cycle and, moreover, the following holds: for all g = 1, . . . , fc — 1 if =z ^ j 
then Zg_|_i =j — !■ f for some f and, if Ik =i — > j then li =j f for some f . We 
say that a datalog program has no presistencies iff there is no perfect cycle in 
the adorned dependency graph. 

Given any datalog program V, we can unwind the rules several times and 
produce a conjunctive query (over the EDB predicates of V) generated by V. 
Program V can, therefore, be viewed as being equivalent to an infinite disjunction 
of all these conjunctive queries. Gonsidering a conjunctive query, we freeze the 
body of the query by turning each of the subgoals into facts in a database; we 
obtain, thus, the canonical database of this query. The constants in the canonical 
database that correspond to variables that also appear in the head of the query 
are conveniently called head constants. If all EDB predicates are binary, we may 
view a canonical database as a labeled graph. Gorrespondingly, we can refer to 
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a canonical database generated by a datalog program. We say that a datalog 
program is connected iff all head variables of the program rules occur also in 
the bodies of the corresponding rules and any canonical database (viewed as a 
hypergraph) is connected. 

A containment mapping from a conjunctive query Qi to a conjunctive query 
Q 2 , is a function, h, that maps all variables of Qi on variables of Q 2 , so that 
whenever an atom p(ATi, X2, . . .) occurs in Qi, then an atom p{h{Xi), h{X 2 ), ■ ■ .) 
occurs in Q 2 - 

The following are technical lemmata used in the proof of the main results. 

Lemma 1. Let V he a Datalog program and c a conjunctive query. There exists 
a positive integer N depending only on the sizes of the program and the query so 
that the following holds: If there is a conjunctive query Ci generated by V such 
that there is a containment mapping from c to Ci, then there is a conjunctive 
query Cj generated by V , of size < N, such that there is a containment mapping 
from c to Cj. 

Proof. First, we note, that, given a sufficiently large conjunctive query gener- 
ated by a datalog program, we can apply a pumping on it, obtaining a shorter 
conjunctive query generated by the same program (see 0). 

Suppose Ci is of size > N. The containment mapping from c to Ci maps c on 
at most |c| constants in c^. Thus, there are > iV — |c| constants in Ci that are not 
involved in this mapping. Consider this “free” part of ci and do on it a pumping, 
obtaining, thus a shorter conjunctive query, on which the containment mapping 
is preserved. □ 



Definition 1. Consider a database. A neighborhood of size k in the database 
is a connected substructure of size k. If we view the database as the canoni- 
cal database of a conjunctive query, then, we can refer to neighborhoods in a 
conjunctive query accordingly. 



Lemma 2. Suppose two conjunctive queries Qi, Q 2 such that: each neighbor- 
hood of size up to k appears in one query iff it appears in the other. Then, for 
any conjunctive query Q of size up to k, there is containment mapping from Q 
to the query Q\ iff there is containment mapping from Q to Q 2 . 

2.2 Retrievable Programs 

Let V be any datalog program over EDB predicates ei, 62, . . . , Cm and with out- 
put predicate q. Let v\,V 2 , . . . ,Vn be views defined by datalog programs over 
EDB predicates ei, 62, . . . , Cm; call Vv the union of programs that define the 
views. We want to answer the query Q-p assuming that we do not have ac- 
cess to database relations ei, 62, . . . , em, but only to views v\, . . . , Vn', we need a 
“certain” answer no matter which is the database that yielded the given view 
instance. 
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Definition 2. A certain answer to query Q-p, given a view instance T>v, is a 
tuple t such that: for each database T> over ei,C2, ■ . . , 6 m with T>\> C Vvi'D), t 
belongs to Qp(T>). 

The above definition assumes the open world assumption; under the closed 
world assumption, the definition requires that 2?v = to compute a certain 

answer in this case is harder (see [Q). 

Let Vv be a datalog program over EDB predicates vi,V2, ■ ■ ■ ,Vn- We say that 
Vv is a retrievable program contained in V if, for each view instance I?V) 
tuples in Vvi'Dv) are certain answers. 

Example 1 . Let views V\,V2 be defined over the EDB predicates 61,62,63,64: 

Vi{X,Y) : - 6i(X,r) 
vi{X,Y) : - 6i(W^),e2(^,r) 

V2{X,Y) : - e^{X,Y),e^{Y) 

and the query V by: 



p{X) : - e:i{Y,X),e^{X,Z) 

Then a retrievable program maximally contained in Vv, is: 

Pv{X) : - V2(Y,X),vi{X,Z) 

If Vv is applied on an instance of V\,V2, it will produce a set of answers; this 
is a subset of the answers which will be produced if query V is applied on some 
instance of the relations ci, 62, 63, 64 which yields the given instance of V\,V2- For 
example, suppose vi = {(a, b), {b, c), {b, d), (c, /)} and V2 = {{d, a)}. Then Vv will 
produce the set of answers {(a)}. A database that yields the above view instance 
could have been either T>i = {ei(a, b), ei(b, c), ei(c, /), 62(0, d),ez{d, a), 63(6, c), 
64(a)} or X>2 = {ei(a, 6),6 i( 5, d), 6i(c, /), 62(0?, c), 62(7, g), ez{d,a), ez{d,b), 
64(a), 64(c), 64(6)} (or infinitely many other alternatives). If we apply V on 
Vx, we get the answers {(a), (c)j, whereas if we apply V on V2 we get the an- 
swers {(a), (d)}; tuple (a) is the only answer common to both. It turns out that, 
for any view instance, program Vv provably derives exactly those answers that 
are common to all relational structures over 61,62,63,64 that yield this view 
instance. Note, that if the programs that define the views are applied on V>2, 
then more tuples will be derived than already in the view instance, namely the 
v\{c,g) and V2{d,b). This is called query answering under the open world as- 
sumption i.e., the available data in the views is assumed to be a subset 
of the data derived by the view definition programs if they are applied on the 
non-accessible relations Ci, 62, 63, 64. □ 

We investigate the following problem: Given a datalog program V and mate- 
rialized views v\,V2t ■ ■ ,Vn, construct a datalog program Vv with the following 
properties: 
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(i) Vv is a retrievable program contained in V 

(ii) Every datalog program that satisfies condition (i) is contained in Vy. 
When both conditions (i) and (ii) hold, we say that Vv is a retrievable program 

maximally contained in V . 

A few technical definitions: Given a view instance T>\>, we define an expansion 
of to be any database (over ei, . . . , Cm) that results from Vy after replacing 
the view facts with facts that are implied by one of the conjunctive queries 
produced from the view definitions (i.e., assuming the rules contain only EDB 
predicates in their bodies, we unify each tuple from T>\; with the head of some 
rule, freeze the rest of the variables in the body of the rule and replace the 
view fact by the facts in the body). We denote an expansion of 2 ?v by RiVv)). 
E.g., in example Q if = {vi{a,b),V2{b,c)}, then an expansion is R{T>v) = 
{ei(a, 1 ), 62(1, 6 ), 63(6, c),C4(c)}. 



3 Retrievable Disjunctive Programs 

Let V be any datalog program with output predicate q and V\,V2, ■ ■ ■ ,Vn be 
views given by non-recursive datalog programs over the predicates of V . We will 
construct in the following a disjunctive logic program Vdisj and we will show 
that it is maximally contained in V. 

In fact, we construct a program Vdisj which differs from being a datalog 
program in that it might contain disjunctions and function symbols in the head 
of some rules (which we call V“^-rules). 

The construction is easy: Vdisj contains: (i) all the rules of V except the 
ones that contain as subgoals EDB predicates which do not appear in any of 
the programs of the views and ii) For each view Vi, we complete its definition, 
i.e., we construct a collection of (possibly) disjunctive clauses as follows: We 
rewrite the definition of the view Vi as a disjunction of conjunctive queries. 
For this we rewrite each rule by renaming the head variables and possibly by 
introducing equalities in the bodies of the rules so as all the rules of the view 
definition have the same atom Vi{Xi , . . . , Xm) in their heads. Then, we replace 
all occurrences of each existential variable by a function (Skolem function) of 
the form f{Xi , . . . , Xm), where Xi, . . . , Xm are all variables in the head of the 
rule and / is a function symbol such that for each existential variable we use 
a different function symbol. We then take the disjunction of the bodies of the 
rules and rewrite it as a conjunction of disjuncts. For each disjunct, we create 
a new rule with this disjunct in the head and the atom Vi{Xi, . . . ,Xm) in the 
body. Finally, all equalities are moved in the body as inequalities. The clauses 
obtained by applying this process to all view definitions are called V“^-rules. 

Example 2. Assume that there are two materialized views v\ and V2 available 
defined as follows 

vi{X,Z):~ e{X,Y),r{Y,Z) 

V2{X,X) : - e{X,X) 

V2{X,Z):-r{X,Y),e{Y,Z) 

where e and r are EDB predicates. 
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We construct the V ^-rules as follows: By completing the definition of v\ we 
take: 

vi{X,Z) ^ 3 Y[e{X,Y)^r{Y,Z)] 

We introduce a Skolem function in order to eliminate the existential variable Y . 
We get 

v^{X,Z) ^ e{XJ{X,Z))Arif{X,Z),Z) 

^From this we obtain the following clauses 

e{XJ{X,Z)):-vi{X,Z) 

r{f{X,Z),Z):-v,{X,Z) 

Now we complete the definition of V2 

V 2 {X,Z) [Z = X ^e{X,X)]y 3 Y[r{X,Y) ^e{Y,Z)\ 

and introduce a Skolem function in order to eliminate the existential variable Y . 
We get 



V 2 {X,Z) ^ [Z = X^e{X,X)]y[r{X,g{X,Z))^e{g{X,Z),Z)] 
transforming the right hand side into conjunctive normal form, we finally get 

Z = Xyr{X,g{X,Z)) : - V2{,X,Z) 

Z = Xye{g{X,Z),Z) : - V2{X,Z) 
e{X, X) V r(X, g{X, Z)) : - V2(X, Z) 
e{X,X)ye{g{X,Z),Z) : - V2{X,Z) 

Moving equalities to the right hand side, we finally get the following V~^~ 
rules: 

r(X,g(X,Z)):-V2(X,Z),Z^X 
e{g{X,Z),Z):-V2{X,Z),Z^X 
e{X, X) V r{X, g{X, Z)) : - V2{X, Z) 
e{X,X)ye{g{X,Z),Z)-.-V2{X,Z) 

□ 

We view Vdisj as a program over the EDB predicates vi,V2, ■ ■ ■ ,Vn- The 
computation of program Vdisj applied on a database T> is considered in the 
usual bottom up fashion only that now a) it computes disjunctions of facts US! 
and b) by firing a datalog rule we mean: either we unify each subgoal with an 
already derived fact or we unify the subgoal with a disjunct in an already derived 
disjunction of facts; in the latter case the rest of the disjuncts will appear in the 
newly derived disjunctive fact together with the head literal and other disjuncts 
resulting possibly from other subgoals. (Observe that V“^-rules will be used only 
in the first step of the bottom up evaluation.) Derivation trees for this bottom 
up evaluation are defined in the usual way only that the labels on the nodes 
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are disjunctions of facts. In the following, we will refer to a derivation tree with 
nodes labeled by either facts or disjunctions of facts as a disjunctive derivation 
tree. 

We will refer to the output database of program Vdisj applied on an input 
database, meaning the database that contains only the atomic facts without 
function symbols computed by Vdisj- 

We say that program Vdisj is contained in {equivalent to, respectively) a 
datalog program V iff for any input database, the output database computed by 
Vdisj is contained in (is equal to, respectively) the output database computed 
by 7^. If every disjunctive datalog program contained in V is also contained in 
Vdisj, we say that Vdisj is a retrievable disjunctive datalog program maximally 
contained in V. 

Theorem 1. Let V be any datalog program and vi,...,Vn be views given by 
non-recursive datalog programs over the EDB predicates of V; let Vdisj be the 
disjunctive program constructed from V and the views. Then, program Vdisj is 
a retrievable disjunctive datalog (^) program maximally contained in V. 

Proof. (Sketch) The program Vdisj contains (i) rules of V without view subgoals 
and (ii) non-recursive V“^-rules. Therefore, it can be equivalently thought of as 
program V applied on databases having disjunctions as facts, (i.e., those facts 
that are derived by V“^-rules). 

One direction is easy: Vdisj uses resolution in the bottom up evaluation we 
described, therefore it computes all the logical consequences of V applied on a 
particular disjunctive database, namely the one containing the disjunctive facts 
over EDB predicates of V implied by the view facts and their definitions in terms 
of the EDB predicates of V . Hence Vdisj is contained in any retrievable program 
maximally contained in V . 

For the other direction, we argue as follows: Let Vv be a maximally con- 
tained retrievable datalog program. Consider an arbitrary database T> over the 
EDB predicates v\, . . . ,Vn oiV . Let g(a) be a fact computed by Vv. Consider all 
possible expansions of T> derived by replacing a view fact by a collection of facts 
derived from one of the conjunctions which define this view. Because Vv is con- 
tained in V, for any expansion database, fact q{a) is computed by V; therefore, 
for each expansion database, there is at least one derivation tree which derives 
g(a). Let T be an arbitrary collection containing at least one such derivation tree 
for each expansion database. It suffices to prove that there is a finite derivation 
tree of Vdisj that computes q{a) in T>. 

We conveniently define the merging of two (disjunctive) derivation trees, T\ 
and T 2 : We choose a leaf of T 2 . We identify this leaf with the root of Ti and label 
this node by the disjunction of the two labels. The rest of the nodes are left as 
are (in fact, we hang tree Ti from T 2 ) only that some of the labels are changed: 
The label in the root of T\ appears in the disjunction in every node of the path 
from the particular leaf to the root of T 2 . The label of the leaf of Ti appears 
in the disjunction in every node on a number of paths (arbitrary chosen) each 
leading from the root to a leaf of Ti . 




444 Foto N. Afrati, Manolis Gergatsoulis, and Theodores Kavalieros 

The rest of the proof involves a combinatorial argument to show the following 
lemma: 

Lemma 3. There is a collection T of derivation trees as defined above, such 
that the following holds: Considering as initialization set the set of derivation 
trees T , there is a sequence of mergings that produces a disjunctive derivation 
tree of program Vdisj which computes q{a) in V. 

□ 

4 Queries and Views Defined by Non-recursive Programs 

In this section we consider the case of non-recursive query datalog programs and 
views defined by datalog programs without recursion as well. 

Theorem 2. Suppose that the program defining the query is a non-recursive 
connected datalog program and the programs defining the views are all non- 
recursive datalog programs. Suppose that there exists a retrievable datalog pro- 
gram maximally contained in the query, which has no persistencies. Then there 
exists a retrievable datalog program maximally contained in the query which is 
non-recursive. 

Proof. (Sketch): Let V be the program defining the query and suppose there 
is a retrievable datalog program maximally contained in the query, call it Vy. 
Suppose Vv is recursive. Consider any canonical database of Vy and suppose 
T> is an expansion of this canonical database. Consider the conjunctive query 
generated by V, say Q, such as there is a containment mapping from Q to T> 
{V being viewed as a conjunctive query). Since Vy has no persistencies, any 
canonical database, and hence T> as well, is of low (< a function of the size of 
the program) degree. Moreover Q is connected; consequently, the image oiQ inT> 
may involve only a limited number (< a function of the sizes of the programs) of 
constants of the domain of T>. Therefore Vy is contained in a retrievable program 
which itself is contained in the query, a contradiction. □ 

We, next, present a procedure which produces a retrievable program, when- 
ever the instance of the problem obeys certain conditions. We show that if the 
procedure reaches a datalog program then this is a retrievable program maxi- 
mally contained in the query. 

Our procedure proceeds in two steps: a) From Vdisj we try to obtain a Horn 
program Vnom which might contain function symbols and b) In the case we 
find a Horn program in the previous step, we eliminate from Vnom the function 
symbols deriving the final program Vy. The elimination of function symbols is 
done in a bottom up fashion as in HUI2I. Note that in step (a) we might not 
obtain a Horn program. 

In the first step, we try to obtain program Vnom by applying program trans- 
formation rules to Vdisj ■ The basic transformation rule that we use is unfolding 
pI1 |. Unfolding in disjunctive logic programs is an extension of the unfolding in 
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Horn clause programs. In general, the application of the unfolding rule consists 
of a sequence of elementary unfolding steps. Suppose that we have to unfold 
(elementary unfolding) a clause C at a body atom B using a clause D at & head 
atom B'. Suppose that 9 is the most general unifier of B and B' . Then we get 
a clause whose body is the body of C after replacing the atom B by the body 
of D and whose head is the disjunction of the head of C with the head atoms 
of D except B' . In this clause we apply the unifier 6. Now in order to unfold a 
clause i? at a chosen body atom B, we have to use all program clauses, which 
have a head atom unifiable with B. If we have more than one such head atoms 
we use the clause in all possible ways. Moreover, in the later case we have also to 
unfold the initial clause using the clauses obtained by the previous elementary 
unfolding steps if they also have head atoms unifiable with B and so on. The set 
of clauses obtained in this way, denoted by unfold{C, B), may replace clause R 
in the program. For more details about the unfolding operation, see mg. 

Besides unfolding, some clause deletion rules are used, which preserve 
the semantics of a disjunctive logic program, (thus, we discard useless clauses). 
More specifically, a) we can delete a clause which is a variant of another program 
clause, b) we can delete a clause which is a tautology (i.e. there is an atom which 
appears in both the head and the body of the clause), c) we can delete a failing 
clause (i.e. a clause which has an atom in its body which does not unify with 
any head atom of the program clauses), d) we can delete a clauses which is 
subsumed by another program clause. We say that a clause C subsumes another 
clause D if there exists a substitution 0 such that head{C9) C head{D) and 
body{C9) C body{D). All these deletion rules preserve the equivalence of the 
disjunctive programs [0]. 

In the following, we will also use factoring If two head atoms of a dis- 
junctive clause C have a most general unifier 9 then C9 is a factor of C. When 
we say that we take the factoring closure of a program we mean that we add to 
the program all factors obtained by applying factoring to all program clauses in 
all possible ways. 

All the transformation rules presented so far preserve the equivalence of pro- 
grams. Now we present, through the following two lemmas, two deletion rules, 
which although they do not preserve the equivalence of programs, they preserve 
the set of the atoms of the query predicate which are logical consequences of the 
programs. 

Lemma 4. Let V he a disjunctive logic program, p be a predicate in V and C 
be a clause in V of the form 

Ai V... V 

Suppose that there is an atom Aj in {Ai, . . . , Am} whose predicate is different 
from p, such that there is no clause in V with a body atom which unifies with 
Aj. Let V' = V — {C}. Then, for any database T>, V' pifD) = Vp{T>). 
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Lemma 5. Let V he a disjunctive logic program, p he a predicate in V and C 
he a clause in V of the form 

AiV ...V A^: - 

Suppose that there is no clause in V with a body atom whose predicate is p. 
Suppose also that V is closed under factoring and that there are two or more 
atoms in the head of C whose predicate is p. Let V' = V — {C}. Then, for any 
database T>, V'p{T>) = VpifD). 

In the following, we present an algorithm which gives a systematic way to 
apply the above described operations on V~^ U V deriving, in certain cases, a 
retrievable datalog(yf) program maximally contained in V. We suppose, without 
loss of generality, that all predicates appearing in the body of rules of the query 
program are EDB predicates (as the program is non-recursive, we can always 
obtain a program of this form by applying unfolding to all IDB body atoms) . 

Procedure: 

Input: V~^, V. 

Output: Vder- 

begin 

- Let P = P U V~\ 

- Let Sp be the set of EDB predicates in V. 
while Sp ^ {} and check2 is true do 

begin 

- Select a predicate e from Sp. 

while checkl (e) is true and check2 is true do 

begin 

- Select a clause D from P whose body contains an atom B 
whose predicate is e. 

- Unfold Z? at -B by P. Let P' = {P — {D}) U unfold{D, B). 

- Let P" be the obtained from P' by getting 

the factoring closure of the clauses in unfold{D,B). 

- Let P"' be the program obtained from P" by 
applying the deletion rules. 

- Let P = P'". 
end 

Let Sp — Sp ■ 

end 

end. 

The condition checkl (e) is true if: There is a clause in P with a body atom 
whose predicate is ‘e 
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The condition check2 is true if: There is no clause in P with an EDB atom Ei 
in its body which unifies with an EDB atom E 2 in the head of the same clause. 

Example 3. Consider the following datalog program V 



(1) 


p{X): 


- e{X,Y),r{Y) 


(2) 


p{X): 


- e{Y,X),s{X) 


(3) 


p{X): 


- e{X,X) 



and assume that there are two materialized views vi and V 2 available 

(4) z;i(X,y):-r(X),s(y) 

(5) V 2 {X,Y) : - e{X,Y) 

(6) V 2 {X,Y):- e{Y,X) 

Applying the procedure described in the previous section, we get Vdisj which 
contains the rules (l)-(3) and the following V“^-rules: 

(7) r{X) : - v^{X,Y) 

(8) s(y):-ui(X,y) 

(9) e{X,Y)\/e{Y,X):- V 2 {X,Y) 

Now we apply the procedure described above in order to obtain a datalog 

program which has the same atomic results as Pdisj ■ 

We unfold (1) at ‘e(A', V)’ using (9). We get a new equivalent program, which 
contains the clauses Pi = {2, 3, 7, 8 , 9, 10, 11, 12}, where: 

(10) p{X)\/e{Y,X):-V 2 {X,Y),riY) 

(11) p{X)Ve{Y,X):-V 2 {Y,X),r{Y) 

(12) p{X) V p{Y) : - V 2 (Y, X),r(X),r(Y) 

Unfolding (2) at ^e{X,Y)’ using (9), (10), (11) we get P 2 = {3,7, 8,9,10, 
11,12, 13, 14, 15,16,17}, where: 

(13) p{X)\/e{X,Y):-V 2 {Y,X),s{X) 

(14) p{X)Ve{X,Y):-V 2 {X,Y),s{X) 

(15) p{X)Vp{Y) : - U 2 (U,X),s(X),s(X) 

(16) p{X):-V 2 {X,Y),r{Y),s{X) 

(17) p{X):-V 2 {Y,X),r{Y),s{X) 

Finally, unfolding (3) at ‘e(X,X)’ using (9), (10), (11), (13), (14) we get 
P 3 = {7, 8, 9, 10, 11, 12, 13, 14, 15,16, 17, 18, 19,20, 21, 22, 23, 24}, where: 



(18) 


p{X)V 


e{X,X):~ 


V2(X 


(19) 


p{X)V 


e{X,X):~ 


V2{X 


( 20 ) 


p{X): 


- V2{X,X) 




( 21 ) 


p{X): 


- V2{X,X), 


r{X) 


( 22 ) 


p{X): 


- V2(X,X), 


r{X) 


(23) 


p{X): 


- V2(X,X), 


s{X) 


(24) 


p{X): 


- V2(X,X), 


s{X) 
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Taking the factoring closure of the program, we get from clauses (12) and 
(15) the clauses: 



(25) p{X)-.-v^{X,X),r{X),r{X) 

(26) p{X)-.-V 2 {X,X),s{X),s{X) 

which are also added to the program. Clauses (19), (22), (24), (25) can be 
deleted since they are variants of other program clauses. Clauses (18), (21), (23) 
and (26) can be deleted since they are subsumed by clause (20). Clauses (9), 
(10), (11), (13) and (14) can be deleted because of lemma 2|(i.e. the atom with 
the predicate ‘e’ does not occur in any program clause). Finally, clauses (12) 
and (15) can be deleted from the program because of lemma El Therefore, the 
program obtained so far is {7,8,16,17,20}. Now, unfolding the EDB atoms in 
the bodies of (16) and (17) and deleting the redundant clauses we finally get: 

p{X):-V2{X,X) 

p{Y) : - V2{X,Y),vi{ZuY),vi{X,Z2) 
p{X) : - V2{X,Y),vi{Zi,X),vi{Y,Z2) 

This program is a retrievable Datalog program maximally contained in 7^. □ 

Concerning the procedure described above we can prove the following theo- 
rem. 

Theorem 3. The following hold for the proeedure above: 

1. The procedure always terminates. 

2. Suppose that by applying the procedure above to a program Pdisj we get a 
program Pder whose clause bodies do not contain any EDB atom. Then Pder 
is a non-recursive datalog(^) program which is maximally contained in P. 

In the following we give a syntactic sufficient condition for the algorithm to 
end up with a datalog(yf) program. 

Sufficient condition: Consider the V~^-rules and the query program V . For 
each rule in V there is at most one atom in its body whose predicate occurs in 
the head of a (non-trivial) disjunctive V~^-rule. 



Theorem 4. If the sufficient condition holds for a a set of V~^ -rules and a 
query program V then by applying the procedure to V~^ U V we get a retrievable 
maximally contained non-recursive datalog(f^) program. 

5 Recursive Views 

In the case where the views are given by recursive datalog programs and the 
query by a non-recursive datalog program V, the problem of computing a re- 
trievable program maximally contained in P is reduced to that of non-recursive 
view definitions (following lemma). 
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Lemma 6. Let V he any non-recursive datalog program and vi, . . . ,Vn be views 
given by any datalog programs over the EDB predicates ofV. Then, there exists 
a positive integer N depending only on the sizes of the programs so that the 
following holds: 

Consider a new set of views v[, . . .v'^ with view definitions , Vy >^ , • ■ • , 
respectively: where Vy> is the disjunction of all conjunctive queries that are pro- 
duced by Vy^ and have size less that N. 

If Vy is a retrievable program contained in V under the new view defini- 
tions, then Vv is a retrievable program contained in V under the original view 
definitions. 

Proof. Let c„ be a conjunctive query generated by Vy. Let be all expansions 
of Cy that result after replacing each occurrence of a view atom by one of their 
new definitions. Then, for each there is a conjunctive query ci generated 

by V and there exists a containment mapping, hi, from Ci to 

Now, obtain all expansions i = 1,2,..., of considering the origi- 

nal view definitions. Suppose there exists an i such that there is no conjunctive 
query in V with a containment mapping on oCy^^’’. Then, a view occurrence in 
Cy has been replaced by a long {> N = where s is the maximum size 

of the programs defining the query and the views) conjunctive query R{v)' R(v) 
appears as a subquery in oCy^^’’. ^From we construct a shorter expansion 

oc’f^pi : We can pump R{v) to get a shorter view expansion which, though, pre- 
serves all the neighborhoods of size < m =the size of the maximum conjunctive 
query in V (see lemmata ^ HI)- We apply this pumping on all expansions that are 
longer than N , getting, in the end, an expansion, Cy^^’’, of Cy under the new view 
definitions. There exists, though, a conjunctive query of V with a containment 
mapping on Since and have the same neighborhoods of size up 

to TO, there is a containment mapping also on this is a contradiction. □ 



Theorem 5. Let V be any non-recursive connected datalog program and 
vi, . . . ,Vn be views given by any datalog programs over the EDB predicates of 
V. Suppose there exists a retrievable datalog program maximally contained in 
V which has no persistencies. Then, there exists a retrievable datalog program 
maximally contained in V which is non-recursive. 

Proof. An immediate consequence of the preceding lemma, and theorem 0 □ 

6 Chain Programs 

A chain rule is a rule over only binary predicates; moreover the first variable in 
the head is identical to the first variable of the first predicate in the body, the 
second variable in the head is identical to the last variable of the last predicate 
in the body and the second variable of the i-th predicate is identical to the first 
variable of the i -\- 1-th predicate in the body. A chain program contains only 
chain rules. It is easy to see that any conjunctive query generated by a chain 
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program V corresponds to a canonical database which is a simple path spelling 
a word over the alphabet of the EDB predicate symbols. Thus, we can view a 
chain program as generating a language over this alphabet; we call this language 
L-p. Observe that for chain programs, Lp is a context free language. 

In the case, though, where both the query and the views are defined by 
recursive programs, it becomes undecidable to answer the question whether there 
is a non-empty datalog program that is contained in V and uses only the given 
views as EDB predicates; it remains undecidable even in the case where both 
program V and the views are given by chain programs. 

The reduction, in the following theorem is done from the containment prob- 
lem of context free grammars iniiTT] 

Theorem 6. Given a datalog chain program V and views vi,V 2 , - ■ ■ ,Vm which 
are also given by chain programs over the EDB predicates ofV, it is undecidable 
whether there is a non-empty retrievable datalog program Vv contained in V . 

In some cases, though, we can have a negative answer in the question whether 
there exists a “simple” non-empty retrievable program, as it is stated in the 
theorem that follows. 

Here on, we consider the following situation: A datalog query given by chain 
program V and materialized views, V\,V 2 , ■ ■ ■ ,fm, over the EDB predicates of 
V , which are also given by chain programs. Let Vv be a datalog program (not 
necessarily chain) contained in V that uses only vi, f 2 , . . . , fm as EDB predicates. 

Consider a datalog program V where both IDB and EDB predicates are 
binary. Take any conjunctive query c produced by V and take the canonical 
database of c. We call V simple if, for all c, the canonical database (viewed as a 
directed graph) contains a number of pairwise node-disjoint simple paths with 
endpoints on the two head constants, i.e., besides the head constants every con- 
stant (viewed as a node) has in-degree one and out-degree one. Chain programs 
are simple. 

We need some technical concepts here on pumping lemmas for formal lan- 
guages. Let u> be a word over the alphabet S. For fixed positive integer TV, we 
say that Wi = ux'^vy^'w, i = 1, 2, ... is a pumping sequence for w w = uxvyw 
and \ xy \< N. For context free languages, there exists a positive integer TV, 
such that, for any word in a language, L, longer than N, there exists a pumping 
sequence Wi,i = 1, . . ., such that wi,i = 1, . . . also belongs in the language. We 
call such a pumping sequence a proper pumping sequence wrto language L. 

Theorem 7. Let V be a chain program and vi,V 2 ,--- ,Vm views over the EDB 
predicates of V , which are also given by chain programs Vi,V 2 , ■ ■ ■ ,Vm, respec- 
tively. Suppose there exists a simple retrievable datalog program contained in V. 
Then there exists a fixed positive integer N ( depending only on the sizes of the 
programs for the views and V ), and there is a view definition Vi such that for 
any word w in Lvi, with | w |> IV, the following happens: There is a word w-p 
in Lp such that wp = ww' and for any proper pumping sequence, W\,W 2 , ■ ■ ■, of 
w wrto Lvi, there exists an infinite subsequence . . . such that Wi^+jw' 

also belongs to Lp, for all j = 1,2,.... 
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For an application, consider the following program V\ 

p{X,Y) :-a{X, Zi),p(Zi, Z2),p(Z2, Z3), 0(^3, V) 
p{X,Y):-b{X,Y) 



and the views: 



vi{X,Y) :-a{X,Zi),vi{Zi,Z2),a{Z2,Y) 
vi{X,Y) ■.-b{X,Zi),b{Zi,Y) 

V2{X,Y) :-a{X,Zi),V2{ZuY) 

V2{X,Y) :-a{X,Y) 

Theorem 8. Given the views wi,W 2 , there is no retrievable simple datalog pro- 
gram contained in V. 

Proof. Consider any word w = w'w" of L-p where w' is a sufficiently large word 
of (L„2 respectively). Consider a proper pumping sequence of w' wrto 
(L„2 respectively), W\,W2, ■ ■ ■■ Suppose there exists a retrievable simple program. 
Then, for infinitely many Fs, Wiw' is also a word in Lp, according to the theorem 
above. Observe, though, that any word in Lp longer than four contains 2 fc + 2 
as and fc + 2 6s. wiw' will not retain this balance, therefore it is not a word of 
Lp . □ 

7 Conclusion 

We investigated the problem of answering queries using materialized views. In 
particular, we searched for a retrievable program maximally contained in the 
query. In the case the query is defined by a non-recursive datalog program and 
the views by recursive datalog programs, we reduced the problem to that of non- 
recursive definitions for both the query and the views. We showed that, in the 
case where both the query and the views are defined by recursive datalog pro- 
grams, then the problem becomes easily undecidable; we showed, though, some 
methods to produce negative results. It would be interesting to further pursue 
this latter line of research. In the case both the query and the views are defined 
by non-recursive datalog programs, we showed how, in certain cases, we can 
produce a retrievable non-recursive datalog(Tf) program maximally contained in 
the query. It seems that, further investigation towards this direction will produce 
interesting results. We are currently working on this. 

Acknowledgment: We thank Vassilis Vassalos for helpful discussions. 
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Abstract. A data warehouse stores materialized views derived from 
one or more sources for the purpose of efficiently implementing decision- 
support or OLAP queries. One of the most important decisions in de- 
signing a data warehouse is the selection of materialized views to be 
maintained at the warehouse. The goal is to select an appropriate set 
of views that minimizes total query response time and/or the cost of 
maintaining the selected views, given a limited amount of resource such 
as materialization time, storage space, or total view maintenance time. 
In this article, we develop algorithms to select a set of views to materialize 
in a data warehouse in order to minimize the total query response time 
under the constraint of a given total view maintenance time. As the 
above maintenance-cost view-selection problem is extremely intractable, 
we tackle some special cases and design approximation algorithms. First, 
we design an approximation greedy algorithm for the maintenance-cost 
view-selection problem in OR view graphs, which arise in many practical 
applications, e.g., data cubes. We prove that the query benefit of the 
solution delivered by the proposed greedy heuristic is within 63% of 
that of the optimal solution. Second, we also design an A* heuristic, 
that delivers an optimal solution, for the general case of AND-OR view 
graphs. We implemented our algorithms and a performance study of the 
algorithms shows that the proposed greedy algorithm for OR view graphs 
almost always delivers an optimal solution. 



1 Introduction 

A data warehouse is a repository of integrated information available for querying 
and analysis ITK93IHGMW+95IWi(MI . Figure 1 illustrates the architecture of 
a typical warehouse !WGL+96| . The bottom of the figure depicts the multiple 
information sources of interest. Data that is of interest to the client(s) is derived 
or copied and integrated into the data warehouse, depicted near the top of the 
figure. These views stored at the warehouse are often referred to as materialized 
views. The integerator, which lies in between the sources and the warehouse, is 
responsible for maintaining the materialized views at the warehouse in response 
to changes at the sources |:^CMHW5SiCW51IGM5,^| . This incremental main- 
tenance of views incurs what is known as maintenance cost. We use the term 
maintenance cost interchangeably with maintenance time. 

* Supported by NSF grant IRI-96-31952. 
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One of the advantages of such a system is that user queries can be answered 
using the information stored at the warehouse and need not be translated and 
shipped to the original source(s) for execution. Also, warehouse data is available 
for queries even when the original information source(s) are inaccessible due to 
real-time operations or other reasons. Widom in |Wid95| gives a nice overview 
of the technical issues that arise in a data warehouse. 




Fig. 1. A typical data warehouse architecture 



The selection of which views to materialize is one of the most important 
decisions in the design of a data warehouse. Earlier work |Gu m presents a 
theoretical formulation of the general “view-selection” problem in a data ware- 
house. Given some resource constraint and a query load to be supported at 
the warehouse, the view- selection problem defined in |Gu m is to select a set 
of derived views to materialize, under a given resource constraint, pthat will 
minimize the sum of total query response time and maintenance time of the se- 
lected views. |Gu presents near-optimal polynomial-time greedy algorithms 
for some special cases of the general problem where the resource constraint is 
disk space. 

In this article, we consider the view-selection problem of selecting views to 
materialize in order to optimize the total query response time, under the con- 
straint that the selected set of views incur less than a given amount of total main- 
tenance time. Hereafter, we will refer to this problem as the maintenance-cost 
view-selection problem. The maintenance-cost view-selection problem is much 
more difficult than the view-selection problem with a disk-space constraint, be- 
cause the maintenance cost of a view v depends on the set of other materialized 
views. For the special case of “OR view graphs,” we present a competitive greedy 
algorithm that provably delivers a near-optimal solution. The OR view graphs, 
which are view graphs where exactly one view is used to derive another view, 
arise in many important practical applications. A very important application 
is that of OLAP warehouses called data cubes, where the candidate views for 
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precomputation (materialization) form an “OR boolean lattice.” For the general 
maintenance-cost view-selection problem that arises in a data warehouse, i.e., 
for the general case of AND-OR view graphs, we present an A* heuristic that 
delivers an optimal solution. 

The rest of the paper is organized as follows. The rest of this section gives a 
brief summary of the related work. In Section 2, we present the motivation for the 
maintenance-cost view-selection problem and the main contributions of this arti- 
cle. Section 3 presents some preliminary definitions. We define the maintenance- 
cost view-selection problem formally in Section 4. In Section 5, we present an 
approximation greedy algorithm for the maintenance-cost view-selection prob- 
lem in OR view graphs. Section 6 presents an A* heuristic that delivers an 
optimal set of views for the maintenance-cost view-selection problem in general 
AND-OR view graphs. We present our experimental results in Section 0 Finally, 
we give some concluding remarks in Section El 

1.1 Related Work 

Recently, there has been a lot of interest on the problem of selecting views to 
materialize in a data warehouse. Harinarayan, Rajaraman and Ullman [H K UDtij 
provide algorithms to select views to materialize in order to minimize the total 
query response time, for the case of data cubes or other OLAP applications when 
there are only queries with aggregates over the base relation. The view graphs 
that arise in OLAP applications are special cases of OR view graphs. The au- 
thors in [HR,U96| propose a polynomial-time greedy algorithm that delivers a 
near-optimal solution. Gupta et al. in extend their results to selec- 

tion of views and indexes in data cubes. Gupta in |Gup97| presents a theoretical 
formulation of the general view-selection problem in a data warehouse and gener- 
alizes the previous results to (i) AND view graphs, where each view has a unique 
execution plan, (ii) OR view graphs, (iii) OR view graphs with indexes, (iv) 
AND view graphs with indexes, and some other special cases of AND-OR view 
graphs. All of the above mentioned works r iHRU9f)IGHRU97IGup97n present 
approximation algorithms to select a set of structures that minimizes the total 
query response time under a given space constraint; the constraint represents 
the maximum amount of disk space that can be used to store the materialized 
views. 

Other recent works on the view-selection problem have been as follows. Ross, 
Srivastava, and Sudarshan in lEHsnn!, Yang, Karlapalem, and Li in HKEEI, 
Baralis, Paraboschi, and Teniente in FPT97] . and Theodoratos and Sellis in 
provide various frameworks and heuristics for selection of views in order to 
optimize the sum of query response time and view maintenance time without any 
resource constraint. Most of the heuristics presented there are either exhaustive 
searches or do not have any performance guarantees on the quality of the solution 
delivered. 

Ours is the first article to address the problem of selecting views to materi- 
alize in a data warehouse under the constraint of a given amount of total view 
maintenance time. 
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2 Motivation and Contributions 



Most of the previous work done t [HH,U96ltfHH,U9y|ttup9y| i on designing poly- 
nomial-time approximation algorithms that provably deliver a near-optimal so- 
lution for the view-selection problem suffers from one drawback. The designed 
algorithms apply only to the case of a disk-space constraint. 

Though the previous work has offered significant insight into the nature of the 
view-selection problem, the constraint considered therein makes the results less 
applicable in practice because disk-space is very cheap in real life. In practice, 
the real constraining factor that prevents us from materializing everything at 
the warehouse is the maintenance time incurred in keeping the materialized 
views up to date at the warehouse. Usually, changes to the source data are 
queued and propagated periodically to the warehouse views in a large batch 
update transaction. The update transaction is usually done overnight, so that 
the warehouse is available for querying and analysis during the day time. Hence, 
there is a constraint on the time that can be alloted to the maintenance of 
materialized views. 

In this article, we consider the maintenance-cost view-selection problem which 
is to select a set of views to materialize in order to minimize the query response 
time under a constraint of maintenance time. We do not make any assump- 
tions about the query or the maintenance cost models. It is easy to see that the 
view-selection problem under a disk-space constraint is only a special case of the 
maintenance-time view-selection problem, when maintenance cost of each view 
remains a constant, i.e., the cost of maintaining a view is independent of the set 
of other materialized views. Thus, the maintenance-cost view-selection problem 
is trivially NP-hard, since the space-constraint view-selection problem is NP-hard 

Now, we explain the main differences between the view-selection problem un- 
der the space constraint and the maintenance-cost view-selection problem, which 
makes the maintenance-cost view-selection optimization problem more difficult. 
In the case of the view-selection problem with space constraint, as the query 
benefit of a view never increases with materialization of other views, the query- 
benefit per unit space of a non-selected view always decreases monotonically 
with the selection of other views. This property is formally defined in |Gu pEZI 
as the monotonicity property of the benefit function and is repeated here for 
convenience. 



Definition 1 . (Monotonic Property) A benefit funetion B, which is used to 
prioritize views for selection, is said to satisfy the monotonicity property for a 
set of views M with respect to distinct views V\ and V2 if B{{Vi,V2\,M) is less 
than (or equal to) either B{{Vi}, M) or B{{V2}, M). 

In the case of the view-selection problem under space constraint, the query- 
benefit per unit space function satisfies the above defined monotonicity property 
for all sets M and views Vi and V 2 - However, for the case of maintenance- 
cost view-selection problem, the maintenance cost of a view can decrease with 
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selection of other views for materialization and hence, the query-benefit per unit 
of maintenance-cost of a view can actually increase. Hence, the total maintenance 
cost of two “dependent” views may be much less than the sum of the maintenance 
costs of the individual views, causing the query-benefit per unit maintenance-cost 
of two dependent views to be sometimes much greater than the query-benefit 
per unit maintenance-cost of either of the individual views. The above described 
non-monotonic behavior of the query-benefit function makes the maintenance- 
problem view-selection problem intractable. The non-monotonic behavior of the 
query-benefit per unit maintenance-cost function is illustrated in Example in 
Section where it is shown that the simple greedy approaches presented in 
previous works for the space-constraint view-selection problem could deliver an 
arbitrarily bad solution when applied to the maintenance-cost view-selection 
problem. 

Contributions In this article, we have identified the maintenance-cost view- 
selection problem and the difficulty it presents. We develop a couple of algo- 
rithms to solve the maintenance-cost view-selection problem within the frame- 
work of general query and maintenance cost models. For the maintenance-cost 
view-selection problem in general OR view graphs, we present a greedy heuris- 
tic that selects a set of views at each stage of the algorithm. We prove that 
the proposed greedy algorithm delivers a near-optimal solution. The OR view 
graphs, where exactly one view is used to compute another view, arise in many 
important practical applications. A very important application is that of OLAP 
warehouses called data cubes, where the candidate views for precomputation 
(materialization) form an “OR boolean lattice.” We also present an A* heuris- 
tic for the general case of AND-OR graphs. Performance studies indicate that 
the proposed greedy heuristic almost always returns an optimal solution for OR 
view graphs. The maintenance-cost view-selection was one of the open problems 
mentioned in |Ou By designing an approximate algorithm for the problem, 
this article essentially answers one of the open questions raised in |Ou m- 

3 Preliminaries 

In this section, we present a refinement of a few definitions from |Oup97| used 
in this article. Throughout this article, we use V{G) and E{G) to denote the set 
of vertices and edges respectively of a graph G. 

Definition 2 . (Expression AND-DAG) An expression AND-DAG for a view, 
or a query, V is a directed acyclic graph having the base relations (and materi- 
alized views) as “sinks” (no outgoing edges) and the node V as a “source” (no 
incoming edges). If a node/view u has outgoing edges to nodes vi,V2, ■ ■ ■ ,Vk, 
then u can be computed from all of the views V\,V2, ■ ■ ■ ,Vk and this depen- 
dence is indicated by drawing a semicircle, called an AND arc, through the edges 
{u, til), {u, V2), . ■ ■ ,{u, Vk). Such an AND arc has an operato^)] and a query-cost 

^ The operator associated with the AND arc is actually a fc-ary function involving 
operations like join, union, aggregation etc. 
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associated with it, which is the cost incurred during the computation of u from 

Vi,V2, . . . ,ffc. 





Fig. 2. a) An expression AND-DAG b) An expression ANDOR-DAG 



Definition 3. (Expression ANDOR-DAG) An expression ANDOR-DAG 
for a view or a query V is a directed aeyclic graph with V as a source and 
the base relations as sinks. Eaeh non-sink node v has associated with it one or 
more AND ares, where each AND arc binds a subset of the outgoing edges of 
node V. As in the previous definition, eaeh AND arc has an operator and a eost 
associated with it. More than one AND arc at a node depicts multiple ways of 
computing that node. 

Figure 0 shows an example of an expression AND-DAG as well as an expres- 
sion ANDOR-DAG. In Figure 0 (b), the node a can be computed either from 
the set of views {6, c, d} or {d, e, /}. The view a can also be computed from the 
set {j, k, /}, as d can be computed from j or k and e can be computed from k. 

Definition 4. (AND-OR View Graph) A directed acyclic graph G having 
the base relations as the sinks is called an AND-OR view graph for the views 
(or queries) Vi,V 2 , . . . ,Vk if for each Vi, there is a subgraplQ Gi in G that is an 
expression ANDOR-DAG for Vi. Each node v in an AND-OR view graph has 
the following parameters assoeiated with it: query-frequeney fy (frequeney of the 
queries on v), update- frequency gy (frequency of updates on v), and reading- 
eost Ry (cost incurred in reading the materialized view v). Also, there is a 

^ An AND-OR view graph H is called a subgraph of an AND-OR view graph G if 
V(H) C V{G), E(H) C E{G), and each edge ei in H is bound with the same set 
of edges through an AND-arc as it is bound through an AND-arc in G. That is, if 
Cl, 62 G E{G), ei G E{H), and ei and 62 are bound by an AND-arc (which may 
bind other edges too) in G, then 62 G E{H), and ei and 62 are bound with the same 
AND-arc in H. For example, Figure 0 (a) is a subgraph of Figure0(b), but Figure0 
(a) without the edge (c, h) is not. 
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maintenance-cost function [/C0 associated with G, such that for a view v and a 
set of views M, UC{v, M) gives the cost of maintaining v in presence of the set 
M of materialized views. 

Given a set of queries Qi, Q 2 , • • ■ , Qk to he supported at a warehouse, 
shows how to construct an AND-OR view graph for the queries. 

Definition 5. (Evaluation Cost) The evaluation cost of an AND-DAG H 
embedded in an AND-OR view graph G is the sum of the costs associated with the 
AND arcs in H , plus the sum of the reading costs associated with the sinks/leaves 
ofH. 

Definition 6. (OR View Graphs) An OR view graph is a special case of an 
AND-OR view graph, where each AND-arc binds exactly one edge. Hence, in OR 
view graphs, we omit drawing AND arcs and label the edges, rather than the AND 
arcs, with query-costs. Also, in OR view graphs, instead of the maintenance cost 
function UG for the graph, there is a maintenance-cost value associated with 
each edge (u,v), which is the maintenance cost incurred in maintaining u using 
the materialized view v. Figure\3 shows an example of an OR view graph Q. 

4 The Maintenance-Cost View-Selection Problem 

In this section, we present a formal definition of the maintenance-cost view- 
selection problem which is to select a set of views in order to minimize the total 
query response time under a given maintenance-cost constraint. 

Given an AND-OR view graph G and a quantity S (available maintenance 
time), the maintenance-cost view- selection problem is to select a set of views M, 
a subset of the nodes in G, that minimizes the total query response time such 
that the total maintenance time of the set M is less than S. 

More formally, let Q{v,M) denote the cost of answering a query v (also a 
node of G) in the presence of a set M of materialized views. As defined before, 
UG{v,M) is the cost of maintaining a materialized view v in presence of a set 
M of materialized views. Given an AND-OR view graph G and a quantity S, 
the maintenance-cost view-selection problem is to select a set of views/nodes M, 
that minimizes t{G,M), where 

r{G,M)= fvQ(v,M), 

vev(G) 

under the constraint that U{M) < S, where U{M), the total maintenance time, 
is defined as 

U{M) = g,jUC{v,M). 

jjGM 

The view-selection problem under a disk-space constraint can be easily shown 
to be NP-hard, as there is a straightforward reduction |Gup97| from the minimum 
set cover problem. Thus, the maintenance-cost view-selection problem, which 
is a more general problem as discussed in Section El is trivially NP-hard. 

® The function symbol “UC” denotes update cost. 
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Computing Q(v,M) The cost of answering a query v in presence of M, Q(v, M), 
in an AND-OR view graph G is actually the evaluation cost of the cheapest AND- 
DAG Hy for v, such that Hy is a subgraph of G and the sinks of Hy belong to 
the set M U L, where L is the set of sinks in G. Here, without loss of generality, 
we have assumed that the nodes in L, the set of sinks in G, are always available 
for computation as they represent the base tables at the source(s). Thus, Q{v, cj>) 
is the cost of answering a query on v directly from the source(s). For the special 
case of OR view graphs, Q{v, M) is the minimum query-length of a path from v 
to some u € (M U L), where the query-length of a path from u to u is defined as 
i?„, the reading cost of u, plus the sum of the query-costs associated with the 
edges on the path. 

Computing UC(v,M) in OR view Craphs The quantity UC{v, M), as defined 
earlier, denotes the maintenance cost of a view v with respect to a selected 
set of materialized views M, i.e., in presence of the set M U L. As before, we 
assume that the set L of base relations in G is always available. In general AND- 
OR view graphs, the function U G, which depends upon the maintenance cost 
model being used, is associated with the graph. However, for the special case of 
OR view graphs, UC{v,M) is computed from the maintenance costs associated 
with the edges in the graph as follows. The quantity UC{v,M) is defined as the 
minimum maintenance-length of a path from v to some u € (M UL) — {u}, where 
the maintenance-length of a path is defined as the sum of the maintenance-costs 
associated with the edges on the pathfl The above characterization of UG{v, M) 
in OR view graphs is without any loss of generality of a maintenance-cost model, 
because in OR view graphs a view u uses at most one view to help maintain 
itself. 

In the following example, we illustrate the above definitions of Q and UC on 
OR view graphs. 




Maintenance-cost of edges (Vj^, B) and (V2 , B) = 4. 
All other maintenance costs and query costs are 0 
All query and update frequencies = 1 
Labels associated with nodes are their 
reading-costs. 



Fig. 3. g : An OR view graph 



Example 1. Consider the OR view graph Q of Figured In the given OR view 
graph g, the maintenance-costs and query-costs associated with each edge is 

^ Note that the maintenance-length doesn’t include the reading cost of the destination 
as in the query-length of a path. 
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zero, except for the maintenance-cost of 4 associated with the edges {V\,B) 
and (V 2 ,B). Also, all query and update frequencies are uniformly 1. The label 
associated with each of the nodes in Q is the reading-cost of the node. Also, the 
set of sinks L — {S}. 

In the OR view graph Q, Q{Vi, (f)) = 12 for all i < 5, because as the query- 
costs are all zero, the minimum query-length of a path from to 5 is just the 
reading-cost of B. Note that Q{B, (j)) = 12. Also, as the minimum maintenance- 
length of a path from a view Vj to B is 4, UC{Vi, cj>) = 4 for all i < 5. 

Knapsack Effect We simplify the view-selection problem as in prior discussions 
( [Hr{U96iGHllU97|Gup97| ) by allowing that a solution may consume “slightly” 
more than the given amount of constraint. This assumption is made to ignore 
the knapsack component of the view-selection problem. However, when proving 
performance guarantee of a given algorithm, we compare the solution delivered 
by the algorithm with an optimal solution that consumes the same amount of 
resource as that consumed by the delivered solution. 

5 Inverted- Tree Greedy Algorithm 

In this section, we present a competitive greedy algorithm called the Inverted- 
Tree Greedy Algorithm which delivers a near-optimal solution for the mainte- 
nance-cost view-selection problem in OR view graphs. 

In the context of view-selection problem, a greedy algorithm was originally 
proposed in mm for selection of views in data cubes under a disk-space 
constraint. Gupta in |Gu m generalized the results to some special cases of 
AND-OR view graphs, but still for the constraint of disk space. The greedy 
algorithms proposed in the context of view-selection work in stages. At each 
stage, the algorithm picks the “most beneficial” view. The algorithm continues 
to pick views until the set of selected views take up the given resource constraint. 
One of the key notions required in designing a greedy algorithm for selection of 
views is the notion of the “most beneficial” view. 

In the greedy heuristics proposed earlier (EEnMEMOSTE u m\) for se- 
lection of views to materialize under a space constraint, views are selected in 
order of their “query benefits” per unit space consumed. We now define a simi- 
lar notion of benefit for the maintenance-cost view-selection problem addressed 
in this article. 

Most Beneficial View 

Consider an OR view graph G. At a stage, when a set of views M has 
already been selected for materialization, the query benefit B{C,M) associated 
with a set of views C with respect to M is defined as t(G, M) — t{G, M U C). 
We define the effective maintenance- cost EU{C,M) of C with respect to M as 
U{M U C) — Based on these two notions, we define the view that has 

® The effective maintenance-cost may be negative. The results in this article hold 
nevertheless. 
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the most query-benefit per unit effective maintenance-cost with respect to M as 
the most beneficial view for greedy selection at the stage when the set M has 
already been selected for materialization. 

We illustrate through an example that a simple greedy algorithm, that at 
each stage selects the most beneficial view, as defined above, could deliver an 
arbitrarily bad solution. 

Example 2. Consider the OR view graph Q shown in Figure O We assume 
that the base relation B is materialized and we consider the case when the 
maintenance-cost constraint is 4 units. 

We first compute the query benefit of Vi at the initial stage when only 
the base relation B is available (materialized). Recall from Example 0 that 
Q{V,,cj)) = 12 for alii < 5 and = 12. Thus, r(^,((i) = 12 x 6 = 72, 

as all the query frequencies are 1. Also, Q{V\,{Vi}) = 7, as the reading-cost 
of Vi is 7, Q{V„{Vi}) = 12 for i = 2, 3, 4, 5, and Q{B,{Vi}) = 12. Thus, 
t(G, {Vi}) = 12x5-1-7 = 67 and thus, the initial query benefit of V} is 72—67 = 5. 
Similarly, the initial query benefits of each of the views V2, bs, V4, and V5 can be 
computed to be 4. 

Also, U{{Vi\) = UC{Vi,{Vi]) = 4 as the minimum maintenance-length of a 
path from any to 5 is 4. Thus, the solution returned by the simple greedy 
algorithm, that picks the most beneficial view, as defined above, at each stage. 



It is easy to see that the optimal solution is {V2, V3, V4, V5} with a query 
benefit of 11 and a total maintenance time of 4. To demonstrate the non- 
monotonic behavior of the benefit function, observe that the query-benefits per 
unit maintenance-cost of sets {V/2}, {V3}, {V2, V^sj are 1, 1, and 7/4 respectively. 
This non-monotonic behavior is the reason why the simple greedy algorithm 
that selects views on the basis of their query-benefits per unit maintenance-cost 
can deliver an arbitrarily bad solution. Figure 0 illustrates through an extended 
example that the optimal solution can be made to have an arbitrarily high query 
benefit, while keeping the simple greedy solution unchanged. 



is {Fl|. 




Maintenance-cost of edges (Vj, B) and (V2 , B) = 4. 

All other edge maintenance costs and query costs are 0 
All query and update frequencies = 1 
Reading-cost of B = 12. 

Reading-cost of Vj = 3, if i is odd, else 12. 
Maintenance-time constraint = 4 
Optimal solution is the set of all views except V j 
Simple Greedy solution = { V j ) 




B 



Fig. 4. An OR view graph, 7i, for which simple greedy performs arbitrarily bad 
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Note that the nodes in the OR view graphs Q and H, presented in Figure 0 
and Figure 0 respectively, can be easily mapped into real queries involving 
aggregations over the base data B. The query-costs associated with the edges 
in Q and Ti. depict the linear cost model, where the cost of answering a query 
on V using its descendant u is directly proportional to the size of the view u, 
which in our model of OR view graphs is represented by the reading-cost of u. 
Notice that the minimum query-length of a path from u to r; in ^ or is 
the reading-cost of v. As zero maintenance-costs in the OR view graphs Q and 
7i can be replaced by extremely small quantities, the OR view graphs G and 
Ti. depict the plausible scenario when the cost of maintaining a view u from a 
materialized view v be negligible in comparison to the maintenance cost incurred 
in maintaining a view u directly from the base data B. 

Definition 7. (Inverted Tree Set) A set of nodes R is defined to be an in- 
verted tree set in a directed graph G if there is a subgraph (not necessarily in- 
duced) Tn in the transitive closure of G such that the set of vertices ofTn is R, 
and the inverse grap/0 of Tr is a tree^ 

In the OR view graph Q of Figure^ any subset o/ {1^2) Rs, ^4, V5} that in- 
cludes V2 forms an inverted tree set. The Tr graph corresponding to the inverted 
tree set R = {V2, RjjRs} has the edges (1^2,14) o,nd (V2,R3) only. 

The motivation for the inverted tree set comes from the following observation, 
which we prove in Lemma 0 In an OR view graph, an arbitrary set O (in 
particular an optimal solution O), can be partitioned into inverted tree sets such 
that the effective maintenance-cost of O with respect to an already materialized 
set M is greater than the sum of effective-costs of inverted tree sets with respect 
to M. 

Based on the notion of an inverted tree set, we develop a greedy heuristic 
called Inverted-tree Greedy Algorithm which, at each stage, considers all inverted 
tree sets in the given view graph and selects the inverted tree set that has the 
most query-benefit per unit effective maintenance-cost. 

Algorithm 1 Inverted- Tree Greedy Algorithm 

Given: An OR view graph (G), and a total view maintenance time constraints' 
BEGIN 

M = 4>-, Be = 0; 

repeat 

for each inverted tree set of views T in G such that T 0 M = cj> 
if {EU{T, M) < S) and (R(T, M) /EU{T, M) > Be) 

Be = B{T,M)/EU{T,M); 

C = T- 

endif 

endfor 



° The inverse of a directed graph is the graph with its edges reversed. 

^ A tree is a connected graph in which each vertex has exactly one incoming edge. 
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M = M U C; 

until ( U{M) > S); 

return M; 

END. O 

We prove in Theorem^ that the Inverted-tree greedy algorithm is guaranteed 
to deliver a near-optimal solution. In Section 0 we present experimental results 
that indicate that in practice, the Inverted-tree greedy algorithm almost always 
returns an optimal solution. We now define a notion of update graphs which is 
used to prove Lemma 0 

Definition 8 . (Update Graph) Given an OR view graph G and a set of 
nodes/views O in G. An update graph of O in G is denoted by Uq and is a 
subgraph of G such that V{Uq) = O, and E{Uq) = {(u, u) \ u,v £ O and 
u(g O) is such that UC{u,{v}) < UC{u, {w}) for all w € O}. We drop the 
superscript G of Uq, whenever evident from context. 

It is easy to see that an update graph is an embedded forest in G. An update 
graph of O is useful in determining the flow of changes when maintaining the set 
of views O. An edge {v, u) in an update graph Uq signifies that the view u uses 
the view v (or its delta tables) to incrementally maintain itself, when the set O 
is materialized. Figure0shows the update graph of {Vi, V 2 , V 5 } in the OR view 
graph Q of our running example in Figure 0 




Fig. 5. The update-graph for {Fi, V 2 , V 3 , V 5 } in G 



Lemma 1 For a given set of views M , a set of views O in an OR view graph G 
can be partitioned into inverted tree sets 0i,02, ■ ■ ■ , Om, such that X]™ 1 EU{Oi, 
M) < EU{ 0 ,M). 



Proof. Consider the update graph Uq of O in G. By definition, Uq is a forest 
consisting of m trees, say, Ui, . . . , Um for some m < |0|. Let, Oi = V{Ui), for 
i < m. 

An edge {y,x) in the update graph Uq implies presence of an edge (x,y) in 
the transitive closure of G. Thus, an embedded tree Ui in the update graph Uq 
is an embedded tree in the transitive closure of the inverse graph of G. Hence, 
the set of vertices Ot is an inverted tree set in G. 
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For a set of views C, we use UC{C,M) to denote the maintenance cost of 
the set C w.r.t. M, i.e., UC{C,M) = M UC), where UC{v,M) 

for a view v is as defined in Section 0 Now, the effective maintenance-cost of a 
set Oi with respect to a set M can be written as EU{Oi,M) = {UC{Oi, M) + 
UC{M,Oi)) - U{M) = UC{Oi,M) - (U{M) - UC{M,Oi)) = UC{0„M) - 
Rd{M, Oi), where Rd{M, C) is used to denote the reduction in the maintenance 
time of M due to the set of views C, i.e., Rd{M, C) = U{M) — UC{M, C). 

As, no view in a set Oi uses a view in a different set Oj for its maintenance, 
UC{0,M) = J2lLiUC{Oi, M). Also, the reduction in the maintenance cost 
of M due to the set O is less than the sum of the reductions due to the sets 
Oi, . . . , O^, i.e., Rd{M, O) < YZi Rd{M, O,). 

Therefore, as EU{0,M) = IIC{0,M) — Rd{M,0), we have EU{0,M) > 

YZiEU(0„M). □ 

Theorem 1 Given an OR view graph G and a total maintenance-time con- 
straint S. The Inverted-tree greedy algorithm (Algorithm^ returns a solution 
M such that U{M) < 2S and M has a query benefit of at least (1 — 1/e) = 63% 
of that of an optimal solution that has a maintenance cost of at most U{M), 
under the assumption that the optimal solution doesn’t have an inverted tree set 
Oi such that U{Oi) > S. I 

The simplifying assumption made in the above algorithm is almost always 
true, because U{M) is not expected to be much higher than S. The following 
theorem proves a similar performance guarantee on the solution returned by the 
Inverted-tree greedy algorithm without the assumption used in Theorem 0 

Theorem 2 Given an OR view graph G and a total maintenance-time con- 
straint S. The Inverted-tree greedy algorithm (Algorithm^ returns a solution 
M such that U{M) < 2S and B{M,(j>)/U{M) > 0.5B(O,4>)/S, where O is an 
optimal solution such that U{0) <= S. I 

Dependence of Query and Update Frequencies Note that we have not made any 
assumptions about the independence of query frequencies and update frequencies 
of views. In fact, the query frequency of a view may decrease with the material- 
ization of other views. It can be shown that the above performance guarantees 
hold even when the query frequency of a view decreases with the materialization 
of other views. 

Time Gomplexity Let G be an OR view graph of size n and Ay be the number 
of ancestors of a node v € V{G). The number of inverted tree sets in G that are 
formed by a node v G V{G) as its root is 2^'’ , because any set of ancestors of v 
(which become a set of descendants in the inverse graph) form an inverted tree 
with V and any inverted tree set that has v as its root is formed from v and a 
subset of its ancestors. Therefore, the total number of inverted tree sets in an OR 
view graph G and also, the total time complexity of a stage of the Inverted-tree 
greedy algorithm is which is in the worst case exponential in n. 
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We note that for the special case of an OR view graph being a balanced binary 
tree, each stage of the Inverted-tree greedy algorithm runs in polynomial time 
O(n^), where n is the number of nodes in the graph. The number of inverted tree 
sets, T{h), in a general balanced tree of height h can also be computed using the 
following recursion: T(/i) = ((2r)^ — l)/(r— 1) = ((n-|-l)^ — l)/(r— 1) = 0(n?/r), 
where r > 1 is the branching factor of the tree. 

As discussed in Section Q our experiments show that the Inverted-tree greedy 
approach takes substantially less time than the A* algorithm presented in the 
next section, especially for sparse graphs. Also, the space requirements of the 
Inverted-tree greedy algorithm is polynomial in the size of graph while that of 
the A* heuristic is exponential in the size of the input graph. 



6 A* Heuristic 

In this section, we present an A* heuristic that, given an AND-OR view graph 
and a quantity S, deliver a set of views M that has an optimal query response 
time such that the total maintenance cost of M is less than S. Recollect that an 
A* algorithm searches for an optimal solution in a search graph where 

each node represents a candidate solution. Roussopoulos in also demon- 

strated the use of A* heuristics for selection of indexes in relational databases. 

Let G be an AND-OR view graph instance and S be the total maintenance- 
time constraint. We first number the set of views (nodes) N of the graph in 
an inverse topological order <vi,V 2 , . . . ,Vn> so that all the edges (vi,Vj) in G 
are such that i > j. We use this order of views to define a binary tree Tg of 
candidate feasible solutions, which is the search tree used by the A* algorithm to 
search for an optimal solution. Each node x in Tq has a label <N^, M^>, where 
Nx = ■ ■ ■ ,Vd\ is a set of views that have been considered for possible 

materialization at x and AIx{<Z Nx), is the set of views chosen for materialization 
at X. The root of Tq has the label <(j),(j)>, signifying an empty solution. Each 
node X with a label <Nx, Mx> has two successor nodes l{x) and r{x) with the 
labels <Nx U {vd+i},Mx> and <Nx U {vd+i},AIx U {t’d+i}> respectively. The 
successor r{x) exists only if Mx U {rid+i} has a total maintenance cost of less 
than S, the given cost constraint. 

We define two function^ g : V(Tg) TZ, and h : V{Tg) > TZ, where TZ 
is the set of real numbers. For a node x G V{Tq), with a label <Nx, Mx>, the 
value g{x) is the total query cost of the queries on Nx using the selected views 
in Mx- That is, 

9x ~ 'y ^ Alx)- 

ViGN^ 

The number h(x) is an estimated lower bound on h*{x) which is defined as the 
remaining query cost of an optimal solution corresponding to some descendant of 
X in Tq- In other words, h(x) is a lower bound estimation of h*{x) = t(G, My) — 



The function g is not to be confused with the update frequency g-u associated with 
each view in a view graph. 
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g{x)^ where My is an optimal solution corresponding to some descendant y oi x 
in To- 

Algorithm 2 A* Heuristic 

Input: G, an AND-OR view graph, and S, the maintenance-cost constraint. 
Output: A set of views M selected for materialization. 

BEGIN 

Create a tree Tq having just the root A. The label associated with A is 

Create a priority queue (heap) L = <A>. 

repeat 

Remove x from L, where x has the lowest g(x) + h{x) value in L. 

Let the label of x be < Afr, M^>, where = {v\,V 2 , ■ • ■ , Vd] for some 
d < n. 

if (d = n) RETURN M^. 

Add a successor of x, l{x), with a label <Nx U {fd+i}, M^> to the list L. 
if (U(M,) < S) 

Add to L a successor of x, r{x), with a label <Nx U {t)d+i}, 

M^U{vd+i}>. 

until ( L is empty); 

RETURN NULL; 

END. O 

We now show how to compute the value h(x) for a node x in the binary 
tree Tq- Let N = V{G) be the set of all views/nodes in G. Given a node x, we 
need to estimate the optimal query cost of the remaining queries in A — N^. 
Let s{v) = gvUG{v, N), the minimum maintenance time a view v can have in 
presence of other materialized views. Also, if a node v G V(G) is not selected for 
materialization, queries on v have a minimum query cost of p(v) = fvQ(y, N — 
{t)}). Hence, for each view v that is not selected in an optimal solution My 
containing M^, the remaining query cost accrues by at least p{v). Thus, we fill 
up the remaining maintenance time available S — U{Mx) with views in A — 
in the order of their p{v)/s{v) values. The sum of the fvQ{v,N — {t>}) values 
for the views left out will give a lower bound on h*{x), the optimal query cost 
of the remaining queries. In order to nullify the knapsack effect as mentioned in 
Section 0 we start with leaving out the view w that has the highest fvQ{v, N — 
{u}) value. 

Theorem 3 The A* algorithm (Algorithm]^ returns an optimal solution. g 

The above theorem guarantees the correctness of A* heuristic. Better lower 
bounds yield A* heuristics that will have better performances in terms of the 
number of nodes explored in Tq. In the worst case, the A* heuristic can take 
exponential time in the number of nodes in the view graph. There is no better 
bounds known for the A* algorithm in terms of the function h(x) used. 
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7 Experimental Results 

We ran some experiments to determine the quality of the solution delivered and 
the time taken in practice by the Inverted-tree Greedy algorithm for OR view 
graphs that arise in practice. We implemented both the algorithms, Inverted- 
tree Greedy and A* heuristic, and ran them on random instances of OR view 
graphs that are balanced trees and directed acyclic OR view graphs with varying 
edge-densities, and random query and update frequencies. We used a linear cost 
model IHRUhbl for the purposes of our experiments. 

We made the following observations. The Inverted-tree Greedy Algorithm 
(Algorithm 0 returned an optimal solution for almost all (96%) view graph 
instances. In other cases, the solution returned by the Inverted-tree greedy al- 
gorithm had a query benefit of around 95% of the optimal query benefit. For 
balanced trees and sparse graphs having edge density less than 40%, the Inverted- 
tree greedy took substantially less time (a factor of 10 to 500) than that taken by 
the A* heuristic. With the increase in the edge density, the benefit of Inverted- 
tree greedy over the A* heuristic reduces and for very dense graphs. A* may 
actually perform marginally better than the Inverted-tree greedy. One should 
observe that OR view graphs that are expected to arise in practice would be 
very sparse. For example, the the OR view graph corresponding to a data cube 
having n dimensions has X)r=i((?)2*) = edges and 2" vertices. Thus, the 
edge density is approximately (0.75)", for a given n. 

The comparison of the times taken by the Inverted-tree greedy and the A* 
heuristic is briefly presented in Figure El In all the plots shown in Figures El the 
different view graph instances of the maintenance-cost view-selection problem 
are plotted on the x-axis. A view graph instance G is represented in terms of 
N , the number of nodes in G, and S, the maintenance-time constraint. The 
view graph instances are arranged in the lexicographic order of (N,S), i.e., all 
the view graphs with smallest N are listed first, in order of their constraint S 
values. In all the graph plots, the number N varied from 10 to 25, and S varied 
from the time required to maintain the smallest view to the time required to 
maintain all views in a given view graph. 

8 Conclusions 

One of the most important decisions in design of a data warehouse is the se- 
lection of views to materialize. The view-selection problem in a data warehouse 
is to select a set of views to materialize so as to optimize the total query re- 
sponse time, under some resource constraint such as total space and/or the total 
maintenance time of the materialized views. All the prior work done on the view 
selection problem considered only a disk-space constraint. In practice, the real 
constraining factor is the total maintenance time. Hence, in this article, we have 
considered the maintenance-cost view-selection problem where the constraint is 
of total maintenance time. 

As the maintenance-cost view-selection problem is intractable, we designed 
an approximation algorithms for the special case of OR view graphs. The OR 
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view graphs arise in many practical applications like data cubes, and other OLAP 
applications. For the general case of AND-OR view graphs, we designed an A* 
heuristic that delivers an optimal solution. 




Performance on Graphs with 10% Edge Density 



■a 

o 

CJ 



's 

I 

H 



20000 

18000 

14000 

10000 

6000 

2000 

45 

40 

30 

20 

10 

0 




(N,S) 



Performance Ratios ( 



Time taken by A* 

Time taken by Inverted — tree 



Greedy ) Random Graphs 




(N,S) (N,S) 



Fig. 6. Experimental Results. The x-axis shows the view graph instances in 
lexicographic order of their (JV, S) values, where N is the number of nodes in 
the graph and S is the maintenance-time constraint. 
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Our preliminary experiment results are very encouraging for the Inverted- 
tree Greedy algorithm. Also, the space requirement of the Inverted-tree greedy 
heuristic is polynomial in size of the graph, while that of the A* heuristic grows 
exponentially in the size of the graph. 
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Abstract. Electronic newsgroups are one of the primary means for the 
dissemination, exchange and sharing of information. We argue that the 
current newsgroup model is unsatisfactory, especially when posted arti- 
cles are relevant to multiple newsgroups. We demonstrate that consid- 
erable additional flexibility can be achieved by managing newsgroups in 
a data warehouse, where each article is a tuple of attribute-value pairs, 
and each newsgroup is a view on the set of all posted articles. Support- 
ing this paradigm for a large set of newsgroups makes it imperative to 
efficiently support a very large number of views: this is the key difference 
between newsgroup data warehouses and conventional data warehouses. 
We identify two complementary problems concerning the design of such 
a newsgroup data warehouse. An important design decision that the sys- 
tem needs to make is which newsgroup views to eagerly maintain (i.e., 
materialize). We demonstrate the intractability of the general newsgroup- 
selection problem, consider various natural special cases of the problem, 
and present efficient exact/approximation algorithms and complexity 
hardness results for them. A second important task concerns the effi- 
cient incremental maintenance of the eagerly maintained newsgroups. 
The newsgroup-maintenance problem for our model of newsgroup defi- 
nitions is a more general version of the classical point-location problem, 
and we design an I/O and CPU efficient algorithm for this problem. 



1 Introduction 

Electronic newsgroups/discussion groups are one of the primary means for the 
dissemination, exchange and sharing of information on a wide variety of top- 
ics. For example, comp . databases and comp.lang.c, typically contain arti- 
cles relevant to computers and computation, while soc . culture . indiain and 
soc . culture .mexican typically contain articles relevant to some world cultures. 

In the current model of posting articles to electronic newsgroups, it is the 
responsibility of the author of an article to determine the newsgroups that are 
relevant to the article. Placing such a burden on the author of the article has 
many undesirable consequences, especially since there are thousands of news- 
groups currently, and the author of an article may not always know all the rel- 
evant newsgroups: (a) articles are often cross-posted to irrelevant newsgroups, 

* Supported by NSF grant IRI-96-31952. 
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resulting in flame wars, and (b) articles that are obviously relevant to multi- 
ple newsgroups may not be posted to all of them, missing potentially relevant 
readers. This unsatisfactory situation will only get worse as the number of news- 
groups increase. 

In this paper, we present a novel model for managing electronic newsgroups 
that does not suffer from the above mentioned problems; we refer to it as the 
Data Warehouse of Newsgroups (DaWN) model. In the DaWN model, the au- 
thor of an article “posts” the article to the newsgroup management system, not 
to any specific newsgroups. Each newsgroup is defined as a view over the set of all 
articles posted to the newsgroup management system, and it is the responsibility 
of the system to determine all the newsgroups into which a new article must be 
inserted. Our newsgroup views allow the flexible combination of selection con- 
ditions on structured attributes of the article, such as its author and posting 
date, along with selection conditions on unstructured attributes of the article, 
such as its subject and body. For example, the newsgroup soc . culture . Indian 
may be defined to contain all articles posted in the current year, each of whose 
bodies is similar to at least one of a set of, say, 100 articles that have been 
manually determined to be representative of this newsgroup. Similarly, the def- 
inition of the newsgroup att.forsale may require that the organization of the 
article’s author be AT&T. The ability to automatically classify posted articles 
into newsgroups based on conditions satisfied by multiple structured and un- 
structured attributes of the article permits very flexible newsgroup definitions. 
Clearly, the DaWN model has the potential of bringing together the author and 
targeted readers of an article. 

In this paper, we identify and address two complementary problems that 
arise when newsgroups are defined as views in the DaWN model. 

Newsgroup-selection problem: Given a large set of newsgroups, which of 
these newsgroups should be eagerly maintained (materialized), and which 
of them should be lazily maintained, in order to conserve system resources 
while allowing efficient access to the various newsgroups? 

We demonstrate that the general newsgroup-selection problem is intractable, 
which motivates a study of special cases that arise in practice. We consider 
many natural special cases, based on the DaWN model of articles and news- 
groups, and present efficient exact /approximation algorithms for them. 
Newsgroup-maintenance problem: Given a new article, in which of the pos- 
sibly large number of materialized newsgroups must the article be inserted? 
We expect any article to be contained in very few newsgroups, while the 
number of newsgroups supported by the data warehouse may be very large. 
Thus, an algorithm that iteratively checks whether the new article needs to 
be inserted in each of the newsgroups would be extremely inefficient. We 
devise an I/O and GPU efficient solution for the newsgroup-maintenance 
problem. 

Both the above problems arise in any data warehouse that supports materi- 
alized views. What distinguishes DaWN from conventional data warehouses are 
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the following characteristics: (i) the extremely large number of views defined in 
DaWN, and (ii) the simple form of individual newsgroup views as selections over 
the set of all posted articles. While we focus on the data warehouse of newsgroups 
in this paper, our techniques and solutions are more generally applicable to any 
data warehouse or multi-dimensional database with the above characteristics, 
such as data warehouses of scientific articles, legal resolutions, and corporate 
e-mail repositories. 

The rest of this paper is organized as follows. In the next section, we briefly 
describe the DaWN model. In Section 0 we consider the problem of efficiently 
maintaining materialized newsgroup views, when new articles arrive into the 
newsgroup system. In Section we discuss the problem of selecting an appro- 
priate set of newsgroup views to eagerly maintain. We present related work for 
each of the two problems in their respective sections. We end with concluding 
remarks in Section 0 

2 The DaWN Model 

A data warehouse is a large repository of information available for querying and 
analysis |IK93iH(IMW+95iWid95j . It consists of a set of materialized views over 
information sources of interest, using which a family of (anticipated) user queries 
can be answered efficiently. In this paper, we show the advantages of modeling 
the newsgroup management system as a data warehouse. 



2.1 Article Store: The Information Source 

A newsgroup article contains information of two types: header fields, and a body. 
The header fields of an article are each identified by a keyword and a value. In 
contrast, the body is viewed as unstructured text. For the purpose of this paper, 
we model articles as having d attributes, A\, A^, . . . , Ad- For header fields of 
the article, the keyword is the name of the attribute; the body of the article 
can be treated as the value of an attribute called Body. The various examples 
in this paper use the commonly specified attributes From, Organization, Date, 
Subject, and Body of newsgroup articles, with their obvious meanings. 

The article store is the set of all posted articles. To facilitate answering 
queries over the article store, various indexes can be built over the article at- 
tributes. For example, the article store may maintain an inverted index structure 
fFal85| on the Body attribute, and a B-tree on the Date attribute. The article 
store along with the index structures is the information source for the data 
warehouse of newsgroups. 



2.2 Newsgroup Views 

An electronic newsgroup contains a set of articles. Current day newsgroup man- 
agement systems support only newsgroups that contain articles that have been 
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explicitly posted to those newsgroups. Here, we focus our attention on news- 
groups that are defined as views over the set of all articles stored in the under- 
lying article store. The articles in the newsgroups are determined automatically 
by the newsgroup management system based on the newsgroup definitions, and 
are not explicitly posted to the newsgroups by the authors. Conventional news- 
groups can also co-exist with such automatically populated newsgroups. 

In this paper, we consider newsgroups that are defined as selection views on 
the article attributes. The atomic conditions that are the basis of the newsgroup 
definitions are of the forms: (a) “attribute similar-to typical-article-body with- 
threshold threshold- value” , (b) “attribute contains value”, and (c) “attribute 
{<,>,=, 7^, <,>} value”. 

Given an article attribute Ai, an attribute selection condition on Ai is an 
arbitrary boolean expression of atomic conditions on . A newsgroup-view defi- 
nition is a conjunction of attribute selection conditions on the article attributes, 
i.e., we consider newsgroups V defined using selection conditions of the form 

where I C {1, 2, . . . , d} is known as the index set of newsgroup V and fj{Aj) is 
an attribute selection condition on attribute Aj . We expect the size of the index 
set |/| of typical newsgroups, that are defined as views, to be small compared to 
the total number of article attributes. 

For example, the newsgroup att .forsale may be defined as “( A (Date > 1 
Jan 1998) (Organization = AT&T) (Subject contains Sale))”. A more inter- 
esting example defines the newsgroup soc . culture . indian as “( A (Date > 1 
Jan 1998) ( V (Body similar-to Bi with-threshold T\) . . . (Body similar-to Biqq 
with-threshold Tioo)))”, where the various Bi’s are the bodies of typical- articles 
that are representative of the newsgroup, and the T^’s are the desired cosine 
similarity match threshold values |SB88j . Both these examples combine the use 
of conditions on structured and text-valued unstructured attributes. 

2.3 The Data Warehouse and Design Decisions 

The data warehouse of newsgroups, defined over the article store, allows users to 
request the set of all articles in any specific newsgroup; this request is referred to 
as a newsgroup query and is the only type of user query supported by the data 
warehouse. The newsgroup management system may decide to eagerly maintain 
(materialize) some of the newsgroups; these newsgroups are the materialized 
views in DaWN, and are kept up to date in response to additions to the article 
store. Newsgroup queries can be answered using the materialized views stored 
in the data warehouse and/or the article store. We use the term newsgroup 
management system to refer to the system consisting of the data warehouses 
and the article store. 

Two of the most important decisions in designing DaWN, the data warehouse 
of newsgroups, are the following: (a) the selection of materialized views to be 
stored at the warehouse, for a given family of (anticipated) user queries; such a 
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selection is important given limited amount of resources such as storage space 
and/or total view maintenance time, and (b) the efficient incremental mainte- 
nance of the materialized views, for a given family of information source data 
updates. The warehouse may maintain various indexes for efficient maintenance 
of the materialized newsgroups. We further discuss the nature and use of these 
index structures in Section 0 

In the following sections, we address the above two key issues in the design of 
DaWN, taking into account the special characteristics that distinguish it from 
conventional data warehouses: (i) the extremely large number of newsgroup views 
defined in DaWN, and (ii) the simple form of individual newsgroup views as 
selections over the set of all posted articles. 

3 The Newsgroup-Maintenance Problem 

In this section, we consider the newsgroup-maintenance problem, i.e., the prob- 
lem of efficiently updating a large set of materialized newsgroups, in response to 
new articles being added to the article store. We start with a precise definition 
of our problem, and present some related work, before describing our I/O and 
CPU efficient solution for the problem. 



3.1 The Problem Definition 

We formally define the newsgroup-maintenance problem as follows. Let Vi, . . . , 
Vn be the large set of materialized newsgroups in the newsgroup system. Given 
a new article m = (5i, 62 , . . . , we wish to output the subset of newsgroups 
that are affected by the posting of the article m to the article store, i.e., the set 
of newsgroups in which m needs to be inserted. 

Given the large number of newsgroups in a typical newsgroup data ware- 
house, the brute force method of sequentially checking each of the newsgroup 
definitions to determine if the article needs to be inserted into the newsgroup 
could be very inefficient. The key challenge here is to devise a solution that takes 
advantage of the specific nature of our problem, wherein the size of the index 
set of newsgroups is small, compared to the number of attributes of the article. 

3.2 Related Work 

The work that is most closely related to our problem of newsgroup-maintenance 
is that on the classical point-location problem. The point-location problem is to 
report all hyper-rectangles from a given set, that contain a given query point. An 
equivalent problem that has been examined by the database community is that 
of the predicate matching problem in active databases and forward chaining rule 
systems PGKW90] . However, our problem is more general because our problem 
combines conditions on ordered domains with conditions on unstructured text 
attributes. Also, each attribute selection condition on an ordered domain may 
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be a general boolean expression of atomic conditions, for example, a union of 
interval ranges. 

There has been a considerable amount of work on the point-location problem 
pt^VliS IIH^deSdalh^deiS.'-ihK Ta.S.'-i] on designing optimal main-memory algorithms. 
However, there hasn’t been any work reported on designing secondary memory 
algorithms for the point-location problem, that have optimal worst-case bounds. 
In contrast, there has been some recent work [IK HV V t),'-iimst)5IV V t)6) reported 
for the dual problem of range-searching. The secondary memory data struc- 
tures developed for the point-location problem like various R-trees, cell-trees, 
hB-trees have good average-case behavior for common spatial database prob- 
lems, but do not have any theoretical worst-case bounds. We refer the reader 
to |Sa,m89alSa,m89bj for a survey. We note that none of the previously proposed 
algorithms can take advantage of small index sets of newsgroup views, i.e., all 
of them treat unspecified selection conditions as intervals covering the entire 
dimension. 



3.3 Independent Search Trees Algorithm 

In this subsection, we present our I/O and CPU efficient approach called the In- 
dependent Search Trees Algorithm for solving the newsgroup-maintenance prob- 
lem. For ease of understanding, we start with a description of the algorithm for 
newsgroup definitions where each attribute is an ordered domain, and each se- 
lection condition is an atomic condition. Thus, each attribute value is restricted 
to be in a single interval, i.e., each newsgroup is a hyper-rectangle. Later, we 
will extend it to handling unstructured text attributes and general boolean ex- 
pressions in the attribute selection conditions. 

Newsgroup-Maintenance of Hyper-Rectangles: Consider n newsgroups Ui,..., 
Vn, where each newsgroup Vi is defined as 

= AjG/i(/y) 

and each /y is of the form (Aj G ctj) for some interval cij on the respective 
ordered domain Dj. The data structure we use consists of d external segment 
tree structures PT!M] Ti,T 2 , . . . ,T(i, such that tree Tj stores the intervals {cy | 
j G Ii,l < i < n}. 

We compute the set of affected newsgroups (views) as follows. We keep an 
array / of size n, where I[i] is initialized to |Ji|, the size of the index set of 
newsgroup Vi. When an article m = {bi,b 2 , ■ ■ ■ , bd) arrives, we search for intervals 
in the external segment trees Tj that contain bj, for all 1 < / < d. While 
searching in the segment tree Tj for bj, when an interval Cy gets hit (which 
happens when bj € Cij), the entry I[i\ is decremented by 1. If an entry I[i] drops 
to zero, then the corresponding newsgroup Vi is reported as one of the affected 
newsgroups. These are precisely the newsgroups in which the article m will have 
to be inserted. 
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Handling the “contains” operator: We now extend our Independent Search Trees 
Algorithm to handle the “contains” operator for unstructured text-valued at- 
tributes such as Subject. Thus, the newsgroup definitions may now use the 
contains operator, i.e., fij can be (Aj contains Sij) for some string Sij and a 
text- valued attribute Aj . To incorporate the contains operator in our newsgroup 
definitions, we use trie data structures Ih'retiOl instead of segment trees for text- 
valued attributes. The question we wish to answer is the following. Given a set of 
strings Sj = {sij \ j G h} (from the newsgroup definitions) and a query string bj 
(the value of the article attribute Aj), output the set of strings Si^j , . . . , 
such that bj contains Si^j for all p < L The matching algorithm used in con- 
junction with the trie data structure can be easily modified to answer the above 
problem. We build a trie on the set Sj of data strings. On a query bj, we search 
the trie data structure for superstring matches for each suffix of bj. The search 
can be completed in (\bj\'^ + l) character comparisons, where \bj \ is the size of the 
query string bj and I is the number of strings reported. The space requirements 
of the trie data structure is k\S\ characters for storing k strings, where |A| is 
the size of the alphabet. Note that the trie yields itself to an efficient secondary 
memory implementation, as it is just a special form of a B-tree. 

Handling Selection Conditions on Body Attribute: We extend our techniques to 
general newsgroup definitions that may also use similar-to with-threshold predi- 
cates on the Body attribute, say A^. In particular, for a view Vj, we consider 

MAd) = \/^{CiAd,B,k,T,k)), 

where C{Ad, Bik,Tik) is used to represent the predicate {Ad similar-to Bn, with- 
threshold Tik). Each Bik here is the body of a typical-article that is representative 
of the Vi newsgroup. 

To handle maintenance of such newsgroup definitions, we build inverted lists 
Li, L2 , . . . , Lp, where p is the size of the dictionary (set of relevant words). Each 
typical-article body Bik is represented by a similarity-vector Bik = {wiki , • ■ • , 
Wikp), where wm is the weight of the l^^ word in Bik. Let the set of all distinct 
similarity- vectors used in the view definitions be Wi, W2, ■ • ■ , Wm- An inverted 
list Li keeps a list of all similarity- vectors Wj’s that have a non-zero weight 
of the word. Also, with each similarity- vector Wj, we keep a list, Rj, of all 
view definitions that use the similarity- vector Wj along with the corresponding 
threshold. In other words, Rj = {{i,Tik)\Vdj = Bik}, stored as an ordered list 
in increasing order of the thresholds. We also keep a dot-product integer Pj 
(initialized to zero). 

Let m = (61, &2) • ■ • ) l>d) be a new article posted to the article store, whose 
Body attribute value bd = {wmi,Wm2, ■ ■ ■ ,Wmp) is the word- vector representing 
the article body. Each bj for j < d is searched in the external segment tree or 
trie Tj, as before, with entries in I decremented appropriately. To compute fid, 
we sequentially scan the inverted list Li, for each non-zero value Wmi in the new 
article’s word- vector bd. For each Wj in Li, we increment the dot-product value 
Pj associated with Wj by {wji * Wmi) l\bd\\W j\. After all required inverted lists 
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have been scanned, for each Wj, we scan its list Rj and for each G Rj, 

such that Tik < Pj, we decrement the value of I[i\ by 1, making sure that each 
I[i\ is decremented at most once. 

Newsgroup-Maintenance of Boolean Expressions: When /y , the attribute selec- 
tion condition involving an ordered attribute Aj, is Aj ^ dij for some interval 
dij on domain Dj, we still store dij in the segment tree Tj. But, whenever dij is 
hit (which happens when bj € dij), we increase I[i] to d (instead of decrement- 
ing by one), guaranteeing that newsgroup Vi is not output. Similarly, we handle 
fij’s of the form ^{Aj contains Sij) or (^(C(Aj, Bik,Tik))) for an unstructured 
text- valued attribute Aj. Also, in an entry I[i] of array /, we store the size of 
positive index set of Vi instead of the size of Vi’s, index set. The positive index 
set if of a view Vi is defined as {j | (j G If A {fij is either of the form {Aj G 
Cij) or {Aj contains Sij) or {C{Aj, Bik, Tik)))}. Similarly, the negative index set 
I~ is defined as {j \ {j G If A {fij is of the form {Aj ^ dij) or ^{Aj contains 
Sij) or ^(C(Aj, Tjfc)))}. 

The generalization to arbitrary boolean expressions for ordered domain at- 
tributes is achieved as follows. An arbitrary boolean expression /y for an arith- 
metic attribute Aj can be represented as \J ^.{Aj G Cijk) or as /\k{Aj ^ dijk), 
for some set of intervals Cijk or dijk on Dj. A segment tree Tj, corresponding 
to the attribute Aj, is constructed as including all the intervals Cijk or dijk 
and corresponding entries in / are decreased by 1, or increased to d on hits to 
intervals appropriately. If Aj is an unstructured text-valued attribute, we can 
easily handle boolean expressions of the type fij = iff ^{Aj contains Sijk)) 
or fij = {/\k~^{Aj contains Sijk)) for a set of strings Syfc. For the Body at- 
tribute Ad, we can handle expressions of the type fid = \J ki^i^d, Bik,Tik)), 
or fid = Ak(^(^(^dT Bik,Tik))). Arbitrary boolean expressions in conjunctive 
normal form (CNF) for unstructured text attributes using contains or similar-to 
with-threshold predicate can also be handled by creating a duplicate attribute for 
each clause in the CNF boolean expression. We note that \/j^{C{Ad, Bik, Tik)) is 
the only type of boolean expression involving the Body attribute, that we expect 
in practice. 

For general boolean expressions, the definitions of if and I~ are appropri- 
ately extended to {j \ {j G If A {fij is either of the form \/k{l^j G Cyfc) or 
Vfc(Aj contains Sijk) or \J f,,{C{Ad, B,k,Tik)))} and {j \ {j G A) A {fij is either 
of the form Aj,(Aj f dijk) or /\^.^{Aj contains Sijk) or f\^,-^{C{Ad,Bik,Tik))))} 
respectivelyu 

Handling Unspecified Article Attributes: An article m is inserted into a view V) 
based on the values of only those article attributes that belong to Vfs index 
set li. However, an attribute selection condition /y can be defined to accept 
or reject an unspecified attribute value. For example, it is reasonable for the 
selection condition (Organization = “AT&T”) to reject articles that have the 

^ We assume that > 0 for each i. Else, we will need to keep a list of views with 
zero |/j^| and report them if the entry I[i\ = |/A| remains unchanged. 
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attribute Organization unspecified, while an unspecified attribute value should 
probably pass the selection condition Orgainization ^ “AOL”. 

To handle such cases of unspecified attribute values in articles, we maintain 
two disjoint integer sets Pj and Fj for each attribute Aj, in addition to its index 
structure. The pass list Pj is defined as {i | (j G ) A {fij accepts unspecified 
values)}. Similarly, fail list Fj is defined as {i \ (j & I~) A {fij rejects unspecified 
values)}. Thus, if an arriving article m has its Aj attribute’s value unspecified, 
we decrement the entry I[i] by one for each i G Pj and increment I[i] to d for 
each i £ Fj, instead of searching in the index structure of Aj. 

CPU Efficient Initialization of Array I: Whenever a new article arrives in the 
newsgroup management system, we need to initialize each entry I[i] of the array 
I to the size of the positive index set of f}. A simple scheme is to explicitly 
store (in persistent memory) \P^\ for each Vj and initialize each of the n entries 
of I every time a new article arrives. However, since the article is expected to 
affect only a few newsgroups, initializing all n elements of the array / can be 
very inefficient. Below, we present an efficient scheme to find the initial values 
in I, only for potentially relevant newsgroups. 

We number the views in such a way that \I^\ < \I'^\ for all 1 < j < fc < n. 
We define d—1 numbers t\,t 2 , ■ ■ ■ ,td-i, where tj is such that \I^.\ < 
i.e., these d—1 numbers define the transition points for the initial values in the 
array I. If no such tj exists for some j < d, then = n + 1 for all j < I < d. 
We create a persistent memory array T of size d, where T[0] =0 and T[i] = ti 
for 1 < i < d. Since the array T contains only d elements, it is quite small and 
hence can be maintained in main memory. To find the initial value of I[i\, |/)^|, 
we find a number x such that T[x — 1] < i < T[x\ in O(logd) main-memory 
time. It is easy to see that |/+| = x. 

When we have to decrement the value of I[i\ for the first time, we initialize 
I[i\ to x — 1, else we reduce the current value of I[i\ by 1. How do we find 
out if I\i] has been decremented before or not? We do this by keeping a bit 
vector H of size n, where H\i] — 1 iff I\i] has been decremented before. Both 
arrays H and I can reside in main-memory, even when thousands of newsgroups 
are maintained as materialized views. For each new article that arrives into the 
newsgroup management system, only the bit vector H needs to be reset to 0, 
which can be done very efficiently in most systems. 

I/O Efficiency: We now analyze the time taken by the above algorithm to output 
the newsgroups affected, when a new article arrives in the newsgroup manage- 
ment system. 

Let B be the I/O block size. We define Kj as the number of intervals in the 
various newsgroup definitions involving an ordered-domain attribute Aj (equiv- 
alently, the number of entries in the segment tree Tj), and mj as the maximum 
number of intervals in tree Tj that overlap. Using the optimal external seg- 
ment tree structure of [R^ - we can perform a search in a segment tree T in 
log^(p) -I- 2{t/B) number of I/O accesses, where p is the number of entries in 
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T, and t is the number of intervals output. Let Sj be the number of similarity- 
vectors that have a non-zero weight in the word, where j < p. Thus, the scan 
of the inverted list Lj takes Sj/ B disk accesses. Therefore, the overall query 
time complexity of the above algorithm is + 2(X]j=i + 

Sj)/B), assuming that there are no conditions involving the contains op- 
erator. An unspecified value in the attribute Aj of a new article m results in 
only (|Pj | -I- \Fj\)/B disk accesses. 

Theorem 1. Consider n newsgroup views whose definitions are conjunctions 
of arithmetic attribute selection and unstructured attribute selection conditions. 
Each attribute selection condition on an ordered- domain attribute Ai is a boolean 
expression of atomic conditions on Ai. 

The Independent Search Trees Algorithm Jor newsgroup-maintenance is eor- 
rect and has a maintenance time of ! ^) + 

Sj)/B) disk accesses, where Kj, mj, and Sj are defined as above. 

If an attribute selection condition for a text attribute As uses the contains 
operator, the maintenance time required to search the index structure (trie) of 
As is -\-mj)/B, where |6j| is the length ofbj, the As attribute string value 
in the arriving article, and mj is the number of newsgroups that match. 

The update time of the data structure due to an insertion of a new view is 
^og^(Kj)) -i- s disk accesses, where s is the number of non- zero weights 
in the typical- article bodies of the added mewQ 

One of main features of the above described algorithm is that the time- 
complexity directly depends upon the total number of specified attribute atomic 
selection conditions, unlike any of the previously proposed algorithms for similar 
problems. 



4 The Newsgroup-Selection Problem 

An important decision in the design of the newsgroup system is to select an 
appropriate set of newsgroups to be eagerly maintained (materialized) . The rest 
of the newsgroups are computed whenever queried, using the other materialized 
newsgroups and/or the article store. A natural optimization criterion is to mini- 
mize the storage space and/or newsgroup maintenance time, while guaranteeing 
that each newsgroup query can be answered within some threshold. 

The query threshold of a newsgroup query is essentially the query-time a user 
request for the newsgroup can tolerate. Heavily accessed important newsgroups 
would have low query thresholds, while newsgroups with very low query fre- 
quencies could tolerate higher query times. The newsgroup-selection problem is 
to select the most “beneficial” newsgroups to materialize, so that all newsgroup 
queries can be answered within their respective query-time thresholds. Often, 
materializing only a small subset of newsgroups will be sufficient to answer each 

^ The array T can be maintained periodically. 
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newsgroup query within its query-time thresholds. This conserves system re- 
sources and facilitates efficient maintenance of materialized newsgroups. 

In this section, we first formulate the general problem of selecting newsgroups 
to be eagerly maintained (materialized), and show that it is, unfortunately, in- 
tractable. We then take advantage of our specific model of newsgroup definitions 
as selection views, and present some efficient exact /approximation algorithms 
and complexity hardness results for the problems. 

As with the previous problem of newsgroup-maintenance, the problems ad- 
dressed here are more generally applicable to the selection of views to materialize 
in a data warehouse, when the queries are restricted to selections and unions over 
the underlying sources of information. 

4.1 General Problem of Newsgroup- Selection 

Consider a labeled bipartite hypergraph G = {Q Li V,E), where Q is the set of 
newsgroup queries and C is a set of candidate newsgroups (views) considered 
for materialization. The set E is the set of hyperedges, where each hyperedge is 
of the form (g, {vi,V 2 , . . . , 'C/}), q G Q, and V\,V 2 , ■ ■ ■ ,vi G V . Each hyperedge 
is labeled with a query-cost of t, signifying that query q can be answered using 
the set of views {v\,V 2 , . . . ,vi} incurring a cost of t units. With each query node 
q G Q, there is a query-cost threshold Tq associated, and with each view node 
V G V, there is a weight (space cost) S{v) associated. We refer to such a graph 
as a query-view graph. This notion of a query-view graph is similar to that used 
in [GHRU97I . but more general. We now define the newsgroup-selection problem. 

Newsgroup-Selection Problem: Given a bipartite query-view hypergraph G 
defined as above, select a minimum weighted set of views M CV to materialize 
such that for each query q G Q there exists a hyperedge {q, {vi,V 2 , ■ ■ ■ , f/}) in G, 
where views V\,V2, ■ ■ ■ ,vi G M and the query-cost associated with the hyperedge 
is less than Tq. 

The above problem is trivially in NP. As there is a straightforward reduction 
from minimum set cover to a special case of the newsgroup-selection problem 
when G has only simple edges, the newsgroup-selection problem is also NP-hard. 
The newsgroup-selection problem is exactly the problem of minimizing the num- 
ber of leaves scheduled in a 3-level AND / OR scheduling problem with internal- 
tree precedence constraints mm- The 2-level version of the AND / OR schedul- 
ing problem with internal-tree precedence constraints is equivalent to minimum 
set cover, while the 4-level AND/OR scheduling problem with internal-tree 
constraints is as hard as the LABEL-COVER |AfjSS93l01Vi97| problem making it 
quasi-NP-hard0 to approximate within a factor of for any 7 > 0. To 

the best of our knowledge, nothing is known about the 3-level version of the 
AND/OR scheduling problem with internal tree constraints. 

The intractability of the general problem leads us to look at some natural 
special cases that arise in practice in the context of newsgroup management, 

® That is, this would imply NP C DTIME(n^°*^^^°®"^). “A proof of quasi-NP-hardness is 
good evidence that the problem has no polynomial-time algorithm” wm- 
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and we present efficient algorithms for each of them. Recall that, in Section |2| 
we allowed newsgroups to be defined only as selection views of a specified form. 
In such cases, a newsgroup needs only the union (U) and selection (cr) relational 
operators to be computed from a set of other newsgroups. However, the above 
formulation of the newsgroup-selection problem is much more general. 

In the next subsection, we restrict the computation of a newsgroup from 
other newsgroups to just using the selection operator. We handle the case of 
using both the union and the selection operators in the subsequent subsection. 

4.2 Queries as Selections over Views 

In this subsection, we focus on the restricted newsgroup-selection problem where 
newsgroup queries are computed using only selections, either on some material- 
ized newsgroup in the data warehouse or the article store. For example, if the 
data warehouse eagerly maintains comp . databases, then answering the news- 
group query comp. databases. object requires only a selection over the news- 
group comp . databases. Being a special case of the general newsgroup-selection 
problem, the above restricted version of the problem helps in better understand- 
ing of the general problem. Moreover, in some domains, the union operation may 
be very expensive or not feasible. For example, let the newsgroup Vi contain all 
articles whose Subject contains “computer”, and the newsgroup V 2 be the set 
of all articles whose Subject contains “compute”. Though, V 2 can be computed 
using Vi and the article store using the selection condition ( A (Subject contains 
“computer”) (^(Subject contains “compute”))) on the article store, it may be 
more efficient to compute V 2 using a simple selection over just the article store. 

In the above restricted newsgroup-selection problem where newsgroup queries 
are computed using only the selection operator, a query uses exactly one view 
for its computation. Hence, the query-view graph defined earlier will have only 
simple edges. The restricted newsgroup-problem has a natural reduction from 
the minimum set cover problem and hence is also NP-complete. However, there 
exists a polynomial-time greedy algorithm that delivers a competitive solution 
that is within 0(log n) factor of an optimal solution, where n is the number of 
newsgroup queries. The greedy algorithm used is almost the same as that used 
to approximate the weighted set cover problem . It can be shown that 

the solution delivered by the greedy algorithm is within 0(log n) of the optimal 
solution. 

So far, we have not taken any advantage of the specific nature of the atomic 
conditions used in the newsgroup definitions. We now do so, and restrict ourselves 
to newsgroups defined using arithmetic operators on the article attributes. As 
the arithmetic operators used in an atomic condition assume an order on the 
domain, all defined newsgroups form some sort of “orthogonal objects” in the 
multidimensional space of the article attributes. We take advantage of this fact 
and formulate a series of problems, presenting exact or approximate algorithms. 

Newsgroup-Selection with Ordered Domains Consider an ordered domain 
D. Consider newsgroups (views or queries) that are ranges over D. In other 
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words, views and queries can be represented as intervals over D. As we restrict 
our attention to using only the selection operator for computing a query, a query 
interval q can be computed using a view interval v only if v completely covers 
q. With each pair (q,v), where v completely covers q, there is a query-cost 
associated, which is the cost incurred in computing q from v. 

We observe here that the techniques used to solve the various problems of 
one-dimensional queries/ views addressed in this section can also be applied to 
the more general case when the queries involve intervals (ranges) along one 
dimension and equality selections over other dimensions. 

Problem (One-dimensional Selection Queries) Given interval views V and 
interval queries Q over an ordered domain D, select a minimum weighted set of 
interval views M such that each query q G Q has a view v G M that completely 
contains q and answers the query q within its query-cost threshold Tq. 

Let n be the number of query and view intervals. There is an O(n^) exact 
dynamic programming algorithm that delivers an optimal solution to the above 
problem. The algorithm appears in the full version of the paper. 

The restricted version of the newsgroup-selection problem considered above is 
a special case of the view-selection problem in OR view graphs defined in |Gu pIZI 
with different optimization criteria and constraints. Gupta |Gu p97| presents a 
simple greedy approach to deliver a solution that is within a constant factor of 
an optimal solution. In effect, we have taken advantage of the restricted model 
of the newsgroup definitions and shown that for this special case of the view- 
selection problem in OR graphs there exists a polynomial-time algorithm that 
delivers an optimal solution. 

Multi- dimension Selection Queries The generalization of the above newsgroup- 
selection problem to d-dimensional selection queries, where each query and view 
is a d-dimensional hyper-rectangle, doesn’t have any better than the O(logn) 
approximation algorithm. The newsgroup-selection problem for d-dimensional 
selection queries can be shown to be NP-complete through a reduction from 
3-SAT. In fact, the problem is a more general version of the classical age-old 
problem of covering points using rectangles in a 2-D plane jITTSlj . for which 
nothing better than an 0(log n) approximation algorithm is known. 

Selection over Body Attribute Conditions Consider newsgroups defined 
by selection conditions of the form fid — \/ f.{C{Ad, Bik,Tik)) over the Body 
attribute of the articles. So, an article m belongs to a newsgroup if m is similar- 
to one of the representative typical-article’s body Bik with a minimum threshold. 
A newsgroup query Q can be answered using another materialized newsgroup 
view V if the set of typical-article bodies of Q is a subset of the set of typical- 
article bodies of V. The newsgroup-selection problem in this setting can be 
shown to be exactly the same problem as the NP-complete set cover problem. 
Thus, allowing selections over the Body attribute makes the newsgroup-selection 
problem as difficult as the general newsgroup-selection problem with simple edges 
in the query- view graph. 
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4.3 Queries as Selections + Unions over Views 

In this section, we look at some special cases of the general newsgroup-selection 
problem, while allowing both selection and union operators for computation of a 
newsgroup query from other materialized newsgroups. The use of both operators 
introduces hyperedges in the query-view graph. As mentioned before, the gen- 
eral newsgroup-selection problem involving hyperedges is intractable, hence we 
take advantage of the restricted model of our newsgroup definitions in designing 
approximation algorithms. 

The newsgroup-selection problem with union and selection operators is a 
special case of the view-selection problem in AND-OR view graphs considered 
in |Uu with different optimization criteria and constraints. Gupta |Gu pEZI 
fails to give any approximation algorithms for the general view-selection problem 
in AND-OR graphs. We take advantage of the special nature of our problem, 
and present some polynomial-time approximation algorithms. 



One-dimensional Selection/Union Queries Consider an ordered domain 
D, and let newsgroups be interval ranges over D. In other words, newsgroup 
views and queries can be represented as intervals over D and a newsgroup query 
interval can be answered using views v\, . . . ,vi if the union of the view intervals 
covers the query interval completely. There is a query-cost associated with each 
such pair, which is the cost incurred in computing q from r;i , U 2 , . . . , . 

Problem (One-dimensional Selection/Union Queries) Given a set of in- 
terval views V and interval queries Q in an ordered domain D, select a minimum 
weighted set of interval views M such that each query q G Q has a set of views 
vi,V 2 , ■ ■ ■ ,vi G M that completely cover q and the query-cost associated is less 
than Tq. 

Consider the following cost model. In addition to a weight associated with 
each view, let there also be a cost C{v) associated with each view. Let the cost 
of computing a query q using a set of views {ui, U 2 , . . . , f;} that cover the query 
q be defined as J2\^iC{vi), i.e., the sum of the costs of the views used. The 
above cost model is general enough for all practical purposes. 

The problem of one-dimensional selection/union queries with the above cost 
model can be shown to be NP-complete through a reduction from the NP-complete 
Partition [C.I7hj problem. See the full version of the paper for the reduction. 
However, if we restrict our attention to the index cost model where the cost 
incurred in computing a query covered by I views is I units (a special case of the 
general cost model above, where each C{v) = 1), we show that there exists an 
dynamic programming solution, where m is the maximum overlap 
between the given queries and k is the maximum individual query-cost threshold 
(which we expect to be small) . The details of the algorithm can be found in the 
full version of the paper. 

The index cost model, where the cost of answering a query q using I views 
is I units, is based on the following very reasonable implementation. If all the 
materialized views are indexed along the dimension D, then the cost incurred 
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in computing the query is proportional to I, the total number of index look-ups. 
Note that the query-costs associated with the edges may not be the actual query 
costs but could be the normalized query-cost “overheads” . 

Average Query Cost Constraint A relatively easier problem in the context of 
the above cost model is when the constraint is on the total (or, average) query 
cost instead of having a threshold on each individual query. For such a case, 
there exists an O(fcn^) time dynamic programming algorithm that delivers a 
minimum-weighted solution, where k{< n) is the average query-cost constraint. 
The dynamic approach here works by maintaining for each interval [1, i] a list of 
k solutions, where the solution corresponds to the minimum-weighted set of 
views that covers the queries in [l,i] under the constraint that the total query 
cost incurred is less than j. 

Multi-dimensional Selection/Union Queries Consider next the newsgroup- 
selection problem where newsgroup queries and views are d-dimensional ranges. 
In other words, views and queries can be represented as hyper-rectangles in a d- 
dimensional space and a query hyper-rectangle can be answered using views 
vi,...,Vk if the union of the view hyper-rectangles covers the query hyper- 
rectangle completely. We wish to select a minimum-weighted set of views such 
that all queries are covered. The simplest version of the problem has no thresh- 
old constraints and it is only required to cover all the query rectangles using the 
materialized views. 

The above problem is NP-complete even for the case of two dimensions. We 
present here a polynomial-time (in n) O(dlogn) approximation algorithm. The 
space of hyper-rectangular queries can be broken down into 0{{2n)‘^) elemen- 
tary hyper-rectangles. Thus, the problem of covering the query hyper-rectangles 
can be reduced to covering the elementary hyper-rectangles with minimum- 
weighted set of views, which is equivalent to a weighted set cover instance 
having 0{{2n)‘^) elements; this has an O(dlogn) approximation algorithm. 



Selection/Union on Body Attribute Conditions In this subsection, we con- 
sider the case where newsgroup queries, having selection conditions of the form 
fid = y BikjTik)), can be computed using only selection and union over 
the Body attribute predicates of the materialized newsgroup views. Due to the 
same similarity-vector occurring in the definition of many newsgroups, a news- 
group V can be computed from the materialized newsgroups W , C 2 , . . . , 14 if each 
similarity- vector of V is included in one of the materialized newsgroups 0 The 
computation involves computing a selection over each of the relevant newsgroups 
followed by a union. If some of the similarity-vectors of a non-materialized view 
V are not covered by the other materialized views, then to answer the newsgroup 
query V, the article store needs to be accessed to select all articles whose bodies 

^ Here, for simplicity, we assume that the threshold corresponding to a particular 
similarity-vector is the same across different newsgroup views it is used in. 
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match (using cosine similarity function with the specified threshold) one of the 
uncovered similarity-vectors. 

The cost of accessing a materialized view V, based on a selection over ordered 
domains, is proportional to the logarithm of the size of V, as it involves searching 
through efficient data structures like B-trees. In contrast, accessing V based on a 
selection over an unstructured attribute involves searching through the inverted 
index data structure. Therefore, when a query is computed as selection and 
union over the Body attribute using some materialized views and the article 
store, the cost incurred in accessing the article store is the dominant factor in 
the total query time. Since each similarity-vector is compared with the inverted 
index independently, the query cost is proportional to the number of similarity- 
vectors sent to the article store. 

Thus, in the context of newsgroup queries computed as selection/union over 
the Body attribute, the natural optimization problem is to select a minimum 
weighted set of newsgroup views to materialize, such that the total number of 
uncovered similarity-vectors summed over all newsgroup queries is less than a 
given threshold. From the discussion in the previous paragraph, the threshold 
on the total number of similarity-vectors summed over all queries translates to 
a threshold on the average query-cost of a newsgroup query. 

The above optimization problem has a greedy O(logn) approximation algo- 
rithm, that at each stage selects the newsgroup that decreases the total number 
of uncovered body articles (summed over all queries) by most. We omit the 
details of the proof here. 

4.4 Related Work 

The newsgroup-selection problem is similar to the view-selection problem defined 
in |(du m- The view-selection problem considered there was to select a set of 
views for materialization to minimize the query response time under the disk- 
space constraint. The key differences between the two problems are the different 
constraint and the minimization goal used. 

Previous work on the view selection problem is as follows. Harinarayan et 
al. |UI{.U96| provide algorithms to select views to materialize for the case of data 
cubes, which is a special case of OR-graphs, where a query uses exactly one view 
to compute itself. The authors in [HK.Uhb] show that the proposed polynomial- 
time greedy algorithm delivers a solution that is within a constant factor of the 
optimal solution. Gupta et al. extend their results to selection of views 

and indexes in data cubes. Gupta |Gup97| presents a theoretical formulation 
of the general view-selection problem in a data warehouse and generalizes the 
previous results to general OR view graphs, AND view graphs, OR view graphs 
with indexes, and AND view graphs with indexes. 

The case of one-dimensional selection queries considered here is a special 
case of the view-selection problem in OR view graphs for which we provided 
a polynomial time algorithm that delivers an optimal solution. Similarly, the 
case of one-dimensional selection/union queries is a special case of the view- 
selection problem in AND-OR view graphs (|Gup97|), which we observe can be 
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solved optimally in polynomial-time for a reasonable cost model. The case of 
selection/union queries on newsgroups defined using similarity- vectors is also a 
special case of the general view-selection problem in AND-OR graphs, for which 
we have designed a provably good approximation algorithm. 

In the computational geometry research community, to the best of our knowl- 
edge, the specific problems mentioned here haven’t been addressed except for the 
preliminary work done on rectangular covers 

5 Conclusions 

We have proposed a novel paradigm for newsgroups, the DaWN model, where 
newsgroups are defined as selection views over the article store. The success of 
the DaWN model clearly depends on the efficiency with which news readers 
can continue to access newsgroups. In this paper, we have looked at the two 
complementary problems that are critical for this efficiency. The first problem 
is the efficient incremental maintenance of eagerly maintained newsgroups. We 
have designed an I/O and CPU efficient algorithm for this problem, based on 
external segment trees, tries, and inverted lists. The second problem is the choice 
of eagerly maintained (materialized) newsgroups. We have demonstrated the 
intractability of the general problem, and discussed various special natural cases 
of the general problem in the context of the DaWN model. 

The success of the DaWN model also depends on the precision with which an 
article can be automatically classified into appropriate newsgroups. This preci- 
sion will be determined by the newsgroup definitions, in particular by the choice 
of representative typical-articles. The problem of a good choice of typical-articles 
for a given newsgroup is orthogonal to the problems/issues addressed in this pa- 
per, and is an interesting open problem, where techniques from data mining and 
data clustering can play a significant role. 

We believe that the DaWN model can also serve as the foundation for al- 
lowing individual users to enhance the standard newsgroups by defining their 
personal newsgroups. Such personal newsgroups can be specified using “pro- 
files” of the users that are matched against the article store, and techniques and 
solutions from dissemination-based systems can be used to advantage here. 
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